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In Memoriam 

The PKDD community has learned with great sadness that 
Jan Zytkow passed away on Jan. 16, 2001. Jan was a well-known 
researcher in the area of scientific discovery, a pioneer in machine 
learning, an author of many publications and books, an organizer 
of many conferences and meetings, the driving force behind the 
PKDD conferences, a wonderful person, and a friend. Those who 
knew him will miss him. These proceedings are dedicated to Jan. 




Preface 



It is our pleasure to present the proceedings of the 12th European Conference 
on Machine Learning {Lecture Notes in Artificial Intelligence 2167) and the 
5th European Conference on Principles and Practice of Knowledge Discovery in 
Databases (this volume). These two conferences were held from September 3-7, 
2001 in Freiburg, Germany, marking the first time - world-wide - that a data 
mining conference has been co-located with a machine learning conference. 

As Program Committee co-chairs of the two conferences, our goal was to 
co-ordinate the submission and reviewing process as much as possible. Here are 
some statistics: a total of 117 papers was submitted to ECML 2001, 78 papers 
were submitted to PKDD 2001, and 45 papers were submitted as joint papers. 
Each paper was carefully reviewed by 3 (in exceptional circumstances 2 or 4) 
members of the Program Committees. Out of the 240 submitted papers, 40 were 
accepted after the first reviewing round, and 54 were accepted on the condition 
that the final paper would meet the requirements of the reviewers. In the end, 90 
papers were accepted for the proceedings (50 for ECML 2001 and 40 for PKDD 
2001 ). 

We were also aiming at putting together a 5-day program that would be at- 
tractive, in its entirety, to both communities. This would encourage participants 
to stay the whole week and thus foster interaction and cross-fertilization. The 
PKDD 2001 conference ran from Monday to Wednesday, and the ECML 2001 
conference from Wednesday to Friday, with the Wednesday program carefully 
selected to be of interest to a mixed audience. On each day there was an invited 
talk by an internationally renowned scientist. Tom Dietterich spoke on Support 
Vector Machines for Reinforcement Learning] Heikki Mannila on Combining Dis- 
crete Algorithmic and Probabilistic Approaches in Data Mining] Antony Unwin 
on Statistification or Mystification, the Need for Statistical Thought in Visual 
Data Mining] Gerhard Widmer on The Musical Expression Project: A Challenge 
for Machine Learning and Knowledge Discovery] and Stefan Wrobel on Seal- 
ability, Search and Sampling: From Smart Algorithms to Active Discovery. In 
addition, there was an extensive parallel program of 11 workshops and 8 tutori- 
als. Two workshops were devoted to results achieved by the participants in the 
two learning and mining challenges that were set prior to the conferences. 

It has been a great pleasure for us to prepare and organize such a presti- 
gious event, but of course we could not have done it without the help of many 
colleagues. We would like to thank all the authors who submitted papers to 
ECML 2001 and PKDD 2001, the program committee members of both confer- 
ences, the other reviewers, the invited speakers, the workshop organizers, and 
the tutorial speakers. We are particularly grateful to the workshop chairs Jo- 
hannes Fiirnkranz and Stefan Wrobel; the tutorial chairs Michele Sebag and 
Hannu Toivonen; and the challenge chairs Petr Berka and Christoph Helma 
for their assistance in putting together an exciting scientific program. Many 
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thanks to Michael Keser for his technical support in setting up the CyberChair 
website, to Richard van de Stadt for developing CyberChair, and to the local 
team at Freiburg for the organizational support provided. We would also like 
to thank Alfred Hofmann of Springer- Verlag for his co-operation in publishing 
these proceedings. Finally, we gratefully acknowledge the financial support pro- 
vided by the sponsors; EU Network of Excellence MLnet II, National Institute 
of Environmental Health Sciences (US), SICK AG, the city of Freiburg, and the 
Albert-Ludwigs University Freiburg and its Lab for Machine Learning. 

Although at the time of writing the event is yet to take place, we are confident 
that history will cast a favorable eye, and we are looking forward to continued 
and intensified integration of the European machine learning and data mining 
communities that we hope has been set in motion with this event. 



July 2001 



Luc De Raedt 
Peter Flach 
Arno Siebes 
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Self-Similar Layered Hidden Markov Models 



Jafar Adibi and Wei-Min Shen 
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{adibi , shen}@isi . edu 



Abstract. Hidden Markov Models (HMM) have proven to be useful in a variety 
of real world applications where considerations for uncertainty are crucial. Such 
an advantage can be more leveraged if HMM can be scaled up to deal with 
complex problems. In this paper, we introduce, analyze and demonstrate Self- 
Similar Layered HMM (SSLHMM), for a certain group of complex problems 
which show self-similar property, and exploit this property to reduce the com- 
plexity of model construction. We show how the embedded knowledge of self- 
similar structure can be used to reduce the complexity of learning and increase 
the accuracy of the learned model. Moreover, we introduce three different types 
of self-similarity in SSLHMM, and investigate their performance in the context 
of synthetic data and real-world network databases. We show that SSLHMM 
has several advantages comparing to conventional HMM techniques and it is 
more efficient and accurate than one-step, flat method for model construction. 



1 Introduction 

There is a vast amount of natural structures and physical systems which contain self- 
similar structures that are made through recurrent processes. To name a few: ocean 
flows, changes in the yearly flood levels of rivers, voltages across nerve membranes, 
musical melodies, human brains, economic markets, Internet web logs and network 
data create enormously complex self-similar data [21]. While there have been much 
effort on observing self-similar structures in scientific databases and natural struc- 
tures, there are few works on using self-similar structure and fractal dimension for the 
purpose of data mining and predictive modeling. Among these works, using fractal 
dimension and self-similarity to reduce the dimensionally curse [21], learning associa- 
tion rules [2] and applications in spatial joint selectivity in databases [9] are consider- 
able. In this paper we introduce a novel technique which uses the self-similar struc- 
ture for predictive modeling using a Self-Similar Layered Hidden Markov Model 
(SSLHMM). 

Despite the broad range of application areas shown for classic HMMs, they do 
have limitations and do not easily handle problems with certain characteristics. For 
instance, classic HMM has difficulties to model complex problems with large states 
spaces. Among the recognized limitations, we only focus on complexity of HMM for 
a certain category of problems with the following characteristics: 1) The uncertainty 
and complexity embedded in these applications make it difficult and impractical to 
construct the model in one step. 2) Systems are self-similar, contain self-similar struc 

L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 1-15, 2001. 
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tures and have been generated through recurrent processes. For instance, analysis of 
traffic data from networks and services such as ISDN traffic, Ethernet LAN’s, Com- 
mon Channel Signaling Network (CCNS) and Variable Bit Rate (VBR) video have all 
convincingly demonstrated the presence of features such as self-similarity, long range 
dependence, slowly decaying variances, heavy-tailed distributions and fractal dimen- 
sions [24]. 

In a companion paper, Adibi and Shen introduced a domain independent novel 
technique to mine sequential databases through Mining by Layered Phases (MLP) in 
both discrete and continuous domains [1]. In this paper we introduce a special form 
of MLP as Self-Similar Layered HMM (SSLHMM) for self-similar structures. We 
show how SSLHMM uses the information embedded in a self-similar structure to 
reduce the complexity of the problem and learn a more accurate model than a general 
HMM. Our result is encouraging and show a significant improvement when a self- 
similar data are modeled through SSLHMM in comparison with HMM. 

The rest of this paper is organized as follows. In section 2 we review the related 
work to this paper. In section 3, we introduce SSLHMM, its definition and properties. 
We explain major components of the system and we drive the sequence likelihood for 
a 2-layers SSLHMM. Section 4 shows the current result with an experimental finding 
in Network data along with discussion and interpretation followed by the future work 
and conclusions in section 5. 

2 Related Work 

HMMs proven tremendously useful as models of stochastic planning and decision 
problems. However, the computational difficulty of applying classic dynamic and 
limitation of conventional HMM to realistic problems has spurred much research into 
techniques to deal with the large states and complex problems. These approaches in- 
cludes function approximation, ratability consideration, aggregation techniques and 
extension to HMM. In the following we refer to those works which are related to our 
approach in general or in specific. We categorize these woks as extension to HMM, 
aggregation techniques and segmentation. 

Regular HMMs are capable of modeling only one process over time. To over- 
come such limitation there are several works to extend HMMs. There are three major 
extension which are close to our method. The first method introduced by Gharamani 
and Jordan as Factorial Hidden Markov Model (FHMM)[12]. This models generalize 
the HMM in which a state is factored into multiple state variables and therefore repre- 
sented in a distributed manner. FHMM combines the output of the N HMMs in a sin- 
gle output signal, such that the output probabilities depend on the N dimensional 
meta-state. As the exact algorithm for this method is intractable they provide ap- 
proximate inference using Gibbs sampling or variational methods. Williams and Hin- 
ton also formulated the problem of learning in HMMs with distributed state represen- 
tations[23], which is a particular class of probabilistic graphical model by Perl [16]. 
The second method known as Coupled Hidden Markov Model (CHMM) consists of 
modeling the N process in N HMMs, whose state probabilities influence one another 
and whose outputs are separate signals. Brand, Oliver and Pentland described poly- 
nomial time training methods and demonstrate advantage of CHMM over HMM [5]. 
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The last extension to HMM related to our approach introduced by Voglar and 
Metaxas as Parallel Hidden Markov Models (PHMM) which model the parallel proc- 
ess independently and can be trained independently [22], In addition, the notion of 
hierarchical HMM has been introduced in [11] in which they extend the conventional 
Baum-Welch method for hierarchical HMM. Their major application is on text recog- 
nition in which the segmentation techniques benefits of the nature of handwriting. The 
major difference of SSLHMM with most of the above mentioned approaches is that 
they do not consider self-similarity for data. SSLHMM uses a recursive learning pro- 
cedure to find the optimal solution and make it possible to use an exact solution rather 
approximation. In addition, SSLHMM as a specific case of MLP use the notion of 
phase in which learner consider laziness for the systems which is along with long 
range dependence and slowly decaying variances. For a detail description of MLP 
please refer to [1]. In addition, FHMM does not provide a hierarchical structure and 
its model is not interpretable while SSLHMM is designed toward interpretability. 
Also, HHMM does not provide the notion of self similarity. 

In sequential planning, HMM-in general and Partial Observable Markov Decision 
Process models (POMDP) specifically have proven to be useful in a variety of real 
world applications [18]. The computational difficulty of applying dynamic program- 
ming techniques to realistic problems has spurred much research into techniques to 
deal with the large state and action spaces. These include function approximation [3] 
and state aggregation techniques [4, 8]. One general method for tackling large MDPs 
is decomposition of a large state model to smaller models [8, 17]. Dean and Lin [8], 
Berteskas and Tsikits [3] also showed some Markov Decision Process are loosely 
coupled and hence enable to get treated by divide-and-conquer algorithms. The evolu- 
tion of the model over time also has been modeled as a semi-Markov Decision Proc- 
ess (SMDP) [18]. Suttonl[20] proposed temporal abstraction, which concatenate se- 
quences of state transition together to permit reasoning about temporarily extended 
events, and form a behavioral hierarchy as in [17]. Most of the work in this direction 
split a well-defined problem space to smaller spaces and they come up with sub- 
spaces and intra actions. In contrast SSLHMM attempt to build a model out of a given 
data through a top down fashion. The use of hierarchical HMMs mostly has been 
employed to divide a huge state space to smaller space or to aggregate actions and 
decisions. MLP in general and SSLHMM in specific are orthogonal to state decompo- 
sition approaches. 

Complexity reduction also has been investigated through segmentation specially 
in Speech Recognition literature. Most of the work is based on probabilistic network, 
Viterbi search for all possible segmentation and using of domain knowledge as hy- 
pothesized segment start and end times [6, 7, 15]. Segmental HMMs also has been 
investigated in [13]. Even though the approach fits in speech recognition applications, 
but it decompose a waveform to local segments each present a “shape” with additive 
noise. A limitation of these approaches in general is that they do not provide a coher- 
ent language for expressing prior knowledge, or integrating shape cues at both the 
local and global level. SSLHMM integrates the prior knowledge in the infrastructure 
of model and as part of knowledge discovery process. 

Based on our knowledge, the notion of Self-Similar Layered HMM has not been 
introduced yet. In addition, the notion of locality and boundary in phases make this 
work distinguish with similar approaches. 
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3 Self-Similar Layered Hidden Markov Model (SSLHMM) 

Conventional HMMs are enable to model only one process at the time which repre- 
sent by transition among the states. Fig. 1(a) shows a HMM with 9 states. A FIMM X 
for discrete symbol observation characterized hy the following set of definitions: state 
transition matrix: S, observation distribution matrix: B, a set of observations M, a set 
of states: n and initial distribution .;r[19]. Having a set of observation O and a model 
X, the old well-known problem is to adjust model parameters to maximize P(0 I A) . 

In the modeling of complex processes, when the number of states goes high, the 
maximization process gets more difficult. A solution provided in other literature is to 
use of a Layered HMM instead [1, 12]. Layered HMM has the capability to model 
more than one process. Hence, it provides an easier platform for modeling complex 
processes. Layered HMM is a combination of two or more HMM processes in a hier- 
archy. Fig. 1(b) shows a Layered HMM with 9 states and 3 super-states, or macro- 
states (big circles with shade), which we refer to them as phases. As we can see, each 
phase is a collection of states bounded to each other. The real model transition hap- 
pens among the states. However, there is another transition process in upper layer 
among phases. The comprehensive transition model is a function of transition among 
states and transition among phases. Layered HMM similar to conventional HMM 
characterized by the following set of definitions: a set of observation: M and a set of 
states: n, a set of phases: N, state transition matrix: S, phase transition matrix: R, 
observation distribution in each state: B and observation distribution in each phase 
:C and initial condition for each layer: ;t;. Learning and modeling follows the well- 
known Baum-Welch algorithm with some modification in forward and backward al- 
gorithm. 





Fig. 1. (a) A normal Hidden Markov Model with 9 states, (b) Self-Similar Layered Hidden 
Markov Model with 9 states and 3 phases. As it shows each phase contains similar structure 
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A macro point of view suggests that the overall system behavior is more a tra- 
jectory among phases. In particular, system may go from one phase to another and 
stays in each phase for a certain amount of time. From a modeling point of view, 
phase is a set of properties, which remain homogenous through a set of states of the 
system and during a period of time, phase may be considered as a collection of lo- 
cally connected sets, groups, levels, categories, objects, states or behaviors. The 
notion of Phase comes with the idea of granularity, organization and hierarchy. An 
observed sequence of a system might be considered as a collection of a behaviors 
among phases (rather than a big collection of states in a flat structure), and it may 
provide enough information for reasoning or be guidance for further details. Hence, 
a sequence with such property could be modeled through a layered structure. For 
example in network application domain a phase could define as “congestion” or 
“stable”. A micro point of view shows that the overall system behavior is a transi- 
tion among the states. 

SSLHMM is a special form of Layered HMM in which there are some con- 
straints on state layer transition, phase layer transition, and observation distribu- 
tion. A closer look at Fig. 1(b) shows that this particular Layered HMM structure 
indeed is a self-similar structure. As it shows, there is a copy of the super model 
(model consists of phases and transition among them) inside of each phase. For 
instance the probability A of going form phase III phase I is equal to the probabil- 
ity a of transition from state 3 to state 1 in phase III (and in other phases as 
well). 

The advantage of such structure is that like any other self-similar model it is 
possible to learn the whole model having any part of the model. Although there 
are a couple of assumptions to hold such properties but fortunately for a large 
group of systems in nature self-similarity is one of their characteristics. In the 
following, we introduce a self-similar Markovian structure in which the model 
shows similar structure across all or at least a range of structure scale. 

3.1 Notation 

In the following we describe our notation for the rest of this paper along with as- 
sumptions and definitions. We follow and modify Rabiner [19] notation for dis- 
crete HMM. A SSLHMM for discreet observation is characterized by the Table 1. 

For the simplicity we use /L = {S,B,7t) for the state layer and A = (R, C, n) for a 
given phase layer structure. In addition we use 0 = (/I, A, Z) for the whole structure 
of SSLHMM in which Z holds the hierarchical information including leaf structure 
and layer structure. Even though the states are hidden but in real world application 
there is a lot of information about physical problems, which points out some charac- 
teristics of state or phase. 
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Table 1. Self-Similar Layered Hidden Markov Model parameters and definition 



Parameter 


Definition 


N 


The number of Phases in the model We label individual 
phases as {1,2, • • • , A} and denote the phase at time t 
as Q,. 


n 


The number of states. We label individual states as 
(1,2, ■••,«} and denote the state at time t as . 


M 


The number of distinct observations 


R = {r,j} 


Phase layer transition probability, where 

nj =P(QM = J\Q,=n and l</,y<A 


S = {^ij] 


State layer transition probability: where 
■Sy = P(Qt+l = J 1 = i) and 1 < i, ; < « 


C = {c‘j{k)} 


The observation probability for phase layer in which 
Cj(k) = P[o, =Vi,\Q,=J] 


B = {b]{k)} 


The observation probability for state layer in which 
b'j (k) = P[o, =vi,\q,= j] 


0 = {o^,02,■■■ ,Oj} 


The observation series 


^ii = PUt = i A a = /] 


The initial state distribution in which l< i <n and 
1</<A 



3.2 Parameter Estimation 

All equations of Layered HMM can be derived similar to conventional HMM. How- 
ever without losing generality we only derive the forward algorithm for a two layer 
HMM as we apply such algorithm to calculate likelihood in next section. In addition, 
we assume a one-to-one relation among states and phases for hidden self-similarity. 
Similar to HMM, we consider the forward variable a, (/, i) defined as 



a,{I,i) = P{oi,02,---o,,Q, =I Aq, =i\Q) ( 1 ) 

which is the probability of the partial observation sequence, 0i,02>'"0, at time t at 
state i and phase I , given the model 0 . Following the Baum-Welch forward proce- 
dure algorithm we can solve for a, (/, i) inductively as follows: 



Initialization: 



af (J, j) = \ Q^=JAq^ = j) 



( 2 ) 
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Induction: 

_ # 

(J, j) = af (I, i) ■ ! * P(o,^i I = J A = j) (3) 

%=i ;=i ! 

in which Wq ,)(y is the transition matrix form state i and phase I to state j in phase 
J. We will show how we calculate this transition matrix in a simple way. 

Termination: 



N n 

P(0\Q)= af{I,i) 

i=\ 1=1 



(4) 



3.3 Self-Similarity Definition and Conditions 

In geometry, self-similarity comes with the term fractal. Fractals have two interesting 
features. First they are self-similar on multiple scales. Second, fractals have a frac- 
tional dimension, as opposed to an integer dimension that idealized objects or struc- 
tures have. To address self-similarity in Layered HMMs, we define three major types 
of Markovian Self-Similar structures: structural self-similarity, hidden self-similarity 
and strong self-similarity. 

Structural Self-Similarity: The structural self-similarity refers to similarity in struc- 
ture in different layers. In our example if phase structure transition be equivalent to 
the state structure transition, we consider model© as a self-similar HMM. In this 
case, we will have = 5- if i=I, J=j and n=N*2. This type of self-similarity refers 

to the structure of the layers. The scale of self- similarity can goes further depends on 
the nature of the problem. It is important to mention that in general, in modeling via 
HMM the number of states preferably keep low to reduce the complexity and to in- 
crease accuracy. One of the main advantage of SSLHMM as it was described is that it 
reduces the number of states dramatically. 

Hidden Self-Similarity: The Hidden self-similarity refers to similarity in observation 
distribution in different layers. We define Hidden self-similarity as the following. 
There is a permutation of I,i, 1 = 'i'(i)in which P(o, l'P(i)) = P(o, I ('P(i), /)) in 
which 



P(o, I ('P(0,i)) = P(o, I C¥(i), j)) -P(j\ 'P(i)) = P(o, I f) ■ P(j I T'(O) (5) 

M J=i 

in our example if we assume 'P(l) = /, T'(2) = II and 'P(3) = III , the above mention 
property for state 1 and phase I will be as the following: 
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P{o^ I (1, /)) = P{o^ I (/,1))P(1 1 /) + P{o^ I {I,2))P{2 1 /) + P{o^ I (/,3))^(3 I /) (6) 

We refer to this type of self-similarity as hidden because it is not intuitive and it is 
very hard to recognize. 

Strong Self-Similarity; A SSLHMM 0= (/I, A) is strong self-similar if the model 
satisfies requirements of structural self-similarity and hidden self-similarity. 



3.4 Assumptions 

In the following we describe our major assumptions, definitions and lemmas to re- 
write the sequence likelihood. 

Decomposability: we assume layers in a Layered HMM model are decomposable. 
The probability of occupancy of a given state in a given layer is: 

P[Q,^1 = / A = j] = P[Q,^, =J]* P[q,^, = j I =7] (7) 

Decomposability property assumes that system transition matrix is decomposable to 
phase transition matrix and state transition matrix. Considering such assumption, the 
overall transition probability for a given state to another state is a Tensor product of 
phase transition and state transition. For a multi-layered HMM the over all transition 
probability would be equal to Tensor products of HMM transition models. Without 
loosing generality we only explain the detail of a 2-layer SSLHMM. The transition 
probability among states and phases will be as following: 

P[Qt+i = 7 A = j\Q,=I Aq,=i]= r,j X s^j (8) 



We show the tensor product with W. 

W = S®R and = r,j Xs^j 



(9) 



Example: If we consider the transition probability for state layer and phase layer as 



^ -3 -5# and 

S= .1 .1 .2[ -7 .1 

^ .1 . 6 ! ^ .1 





m 




.14 




.06 


, we will have : 


.14 


■2j 


W= .49 


.6! 


.21 




.06 




.21 




^9 



.06 .1 .06 .09 .15 .1 .15 .25# 

.02 .04 .21 .03 .06 .35 .05 .1 j 

.02 .12 .09 .03 .18 .15 .05 .3! 

.21 .35 .02 .03 .05 .04 .06 .Olj 

.07 .14 .07 .01 .02 .14 .02 .04^ 

1 

.07 .42 .03 .01 .06 .06 .02 .12i 

.09 .15 .02 .03 .05 .12 .18 .3 j 

.03 .06 .07 .01 .02 .42 .06 .12! 

.03 .18 .03 .01 .06 .18 .06 .36' 
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Lemma 1: Tensor Product of HMMs; Considering a HMM Model as 
X = (W ,B,k) , it is possible to decompose X to smaller models if BW^, W 2 of order 

IWi land IW 2 I such that IWNWi IxlWj I and W = Wi ®W 2 . 

Note: Not all HMMs are decomposable to a Tensor product of smaller models. 

Lemma 2: Markov Property of HMMs Tensor Products: If S and R are Mark- 
ovian transition matrix then W = R® S is Tensor Markov. 

N n 

Rjj =1 for all le /?and S^j =1 for all ie S (10) 

7=1 1=1 

N n N n N n 

’’IJ ^ij~^ (11) 

7=1 1=1 7=1 1=1 7=1 1=1 

Any Tensor Markov Model I ITj I x I IT 2 I is isomorphic by a Markov Model to order of 
IWHITi IXIW 2 I. 

3.5 Re-writing Sequence Likelihood 

By using above mentioned assumptions we can re-write the sequence likelihood for a 
strong self-similar (one-to-one) HMM as following. Hidden self-similarity implies: 

Pio,+i I Q,+i = j A = f) = I = j) if J = j (12) 

Decomposability assumption along with structural self-similarity make it possible to 
calculate W. Hence equation (3) becomes as: 

n # 

«®1 {J, j) = af (I, i) ■ ! * P(o,+i I = j) if i = j (13) 

%=1 i=i ! 

«®i (J, f) = af (I, i) ■ ! * P(o, I f) ■ P( j I T(i)) if i j 

%=i 1=1 ! 1=1 

3.6 The Learning Process 

The learning procedure for SSLHMM is similar to traditional HMM via the expecta- 
tion maximization (EM) [19] except the calculation of a, (i, /) and /5,(i,l) as above. 
We can choose 0 = {X,A,Z) such that its likelihood P{0 I 0) is locally maximized 
using an iterative procedure such as Baum-Welch method. This procedure iterates 
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between E step which fixes the current parameters and computes posterior probabili- 
ties over the hidden states and M step which uses these probabilities to maximize the 
expected log likelihood of the observation. We derived forward variable a in last 
section, and deriving B , the backward parameter is similar to forward parameter. 



4 Result 

We have applied SSLHMM approach to synthetic data and a Network domain data- 
base. Our implementation is in MATLAB programming language and has been tested 
on Pentium III 450 MHz processor with 384 MB RAM. 

4.1 Experiment 1: Synthetic Data 

To compare SSLHMM with HMM, we employed a SSLHMM simulator with the 
capability of simulation of discrete and continuous data. In our simulator, a user has 
the capability to define the number of sequence in experimental pool, length of each 
sequence, number of layers, number of states in each phase, number of phases and 
observation set for discrete environment or a range for continuous observation. We 
verified that the synthetic data is indeed self-similar with H=.6. In this paper we only 
report the comparison of Baum-Welch forward algorithm for HMM with nHMM states 
and a 2-Iayer strong SSLHMM with N phases and n states. The main purpose of this 
experiment is built on the following chain of principles: 

Assume there is a sequence of observation O generated by a self-similar structure. 

• We would like to estimate HMM parameter for such data (n assume to be known 
in advance) 

• We would like to adjust model parameters A. = (S, B, n) to maximize P{0 I A) . 

• Model could be either a flat HMM or a SSLHMM 

• We illustrate that for O = { 0 ^, 02 , ) , P(0 \ SSLHMM) is higher than 
P(0 I HMM) , the probability of the observation given each model. 

• We also observed that if O is not generated by a SSLHMM but by a HMM 
P{0\ SSLHMM) ~P{0\HMM). However due to space limitation we do not 
show the result. 

We ran a series of test for a problem consists of pre-selected number of states, up 
to 15 perceptions and 100 sequence of observation for each run. We assume the 
number of states and phases are known so Baum-Welch algorithm uses Uhmm to build 
the model and SSLHMM use N and n (number of phases and number of states). The 
assumption of strong self-similarity implies that n = N^, as we have a copy of phase 
structure inside of each phase to present state layer. We repeat the whole experience 
with a random distribution for each phase but in a self-similar fashion and for a vari- 
ety of different n and N. First we trained on the 50% of the data and find P(Model I 
train) for both HMM and SSLHMM. In second step we calculate P(Model I test) 
where “tesf’ is the remaining 50% of the data. Fig. 2 shows -log(likelihood) of differ- 
ent experiments. A smaller number of -log(likelihood) indicate a higher probability. 
We ran HMM with prior number of states equal to 9, 16 and 64, and SSLHMM with 
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Simulation 



□ Train HTest 




HMM SSLHMM 



Fig. 2. Negative log likelihood for synthetic data, “x-i” indicates a 2 layers SSLHMM with 
X as number of state in each layer 



number of phases equal to 3, 4 and 8 (shown as 3-s, 4-s and 8-s in the Fig. 4. As we 
may see the best model of SSLHMM outperforms the best model of HMM. In addi- 
tion, the average -log(likelihood) of modeling through SSLHMM in all experiences is 
lower than modeling through HMM by 39%. 

4.2 Experiment 2: Network Data 

Understanding the nature of traffic in high-speed, high-bandwidth communications is 
essential for engineering and performance evaluation. It is important to know the traf- 
fic behavior of some of the expected major contributors to future high-speed network 
traffic. There have been a handful research and development in this area to analyze 
LAN traffic data. Analyses of traffic data from networks and services such as ISDN 
traffic and Ethernet LAN’s have all convincingly demonstrated the presence of fea- 
tures such as self-similarity, long term dependence, slowly decaying variance and 
fractal dimensions. [10, 14]. 

In this experiment we applied the same principle similar to synthetic data ex- 
periment. A sample of network data is logged by the Spectrum NMP. There are 16 
ports p„ on the routers that connect to 16 links, which in turn connect to 16 Ethernet 
subnets (Sn). Note that traffic has to flow through the router ports in order to reach the 
16 subnets. Thus, we can observe the traffic that flows through the ports. There are 
three independent variables: 

• Load: a measure of the percentage of bandwidth utilization of a port during a 10 
minute period. 

• Packet Rate: a measure of the rate at which packets are moving through a port 
per minute. 

• Collision Rate: a measure of the number of packets during a 10 minute period that 
have been sent through a port over the link but have collided with other packets. 
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Data has collected for 18 weeks, from ‘94 to ‘95. There are 16,849 entries, repre- 
senting measurements roughly every 10 minutes for 18 weeks. Fig. 3 illustrates an 
example of collected data for port #8. 



Collision Rate: Port #8 
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Fig. 3. The number of collisions of port #8. Data show self-similarity over different scales 



We applied the HMM and SSLHMM to a given port of database with the purpose 
of modeling the Network data. We did test our technique through cross validation 
and in each round we trained the data with a random half of the data and test over the 
rest. We repeat the procedure for Load, Packet Rate and Collision Rate on all 16 
ports. Fig. 4 illustrates the comparison of HMM and SSLHMM for Load, Packet Rate 
and Collision Rate. Respectively, we ran HMM with prior number of states equal to 2, 
3, 4, 9 and 16, and SSLHMM with number of phases equal to 2, 3 and 4 (shown as 2- 
s, 3-s and 4-s in the Fig. 4). As it shows in Fig. 4 the SSLHMM model with N=4 out- 
performs other competitors in all series of experiments. Our experiment showed - 
log(likelihood) increases dramatically for models with number of sates grater than 16 
as it over fits the data. The best SSLHMM performance beats the best HMM by 23%, 
41% and 38% for Collision Rate, Load and Packets Rate respectively. 

Our experiments show SSLHMM approach behave properly and does not per- 
form worse than HMM even when the data is not self similar or when we do not have 
enough information. The SSLHMM provides a more satisfactory model of the net- 
work data from three point of views. First, the time complexity is such that it is possi- 
ble to consider model with a large number of states in a hierarchy. Second, these lar- 
ger number of states do not require excessively large numbers of parameters relative 
to the number of states. Learning a certain part of the whole structure is enough to 
extend to the rest of the structure. Finally SSLHMM resulted in significantly better 
predictors; the test set likelihood for the best SSLHMM was 100 percent better than 
the best HMM 
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Fig. 4. The comparison of negative log likelihood for Network data for Load, Packets Rate 
and Collision Rate. SSLHMM outperform HMM in all three experiments 
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While the SSLHMM is clearly better predictor than HMM, it is easily interpret- 
able than an HMM as well. The notion of phase may be considered as a collection of 
locally connected sets, groups, levels, categories, objects, states or behaviors as a col- 
lection of certain behavior and it comes with the idea of granularity, organization and 
hierarchy. As it mentioned before in Network application domain a phase could de- 
fine as “congestion” or “stable”. This characteristics is the main advantage of 
SSLHMM over other approaches such as FHMM [12]. SSLHMM is designed toward 
better interpretation as one the main goal of data mining approaches in general. 



5 Conclusion and Future Work 

Despite the relatively broad range of application areas, a general HMM, could not 
easily scale up to handle larger number of states. The error of predictive modeling 
will increased dramatically when the number of sates goes up. In this paper we pro- 
posed SSLHMM and illustrate it is a better estimation than flat HMM when data 
shows self-similar property. Moreover, we introduced three different types of self- 
similarity along with some result on synthetic data and experiments on Network data. 
Since SSLHMM has hierarchical structures and abstract states into phases, it over- 
comes, to a certain extent, the difficulty of dealing with larger number of states at the 
same layer, thus making the learning process move efficient and effective. 

As future work we would like to extend this research to leverage the MLP power 
for precise prediction in both long term and short term. In addition we would like to 
extend this work when the model shows self-similar structure only at a limited range 
of structure scale. Currently we are in process of incorporation of self-similar property 
for Partially Observable Markov Decision Process (POMDP) along with generaliza- 
tion of SSLHMM. 
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Abstract. This paper investigates a new approach for unsupervised and semi- 
supervised learning. We show that this method is an instance of the Classifica- 
tion EM algorithm in the case of gaussian densities. Its originality is that it re- 
lies on a discriminant approach whereas classical methods for unsupervised and 
semi-supervised learning rely on density estimation. This idea is used to im- 
prove a generic document summarization system, it is evaluated on the Reuters 
news-wire corpus and compared to other strategies. 



1 Introduction 

Many machine learning approaches for information access require a large amount of 
supervision in the form of labeled training data. This paper discusses the use of unla- 
beled examples for the problem of text summarization. 

Automated summarization dates back to the fifties [12]. The different attempts in 
this field have shown that human-quality text summarization was very complex since 
it encompasses discourse understanding, abstraction, and language generation [25]. 
Simpler approaches were explored which consist in extracting representative text- 
spans, using statistical techniques and/or techniques based on superficial domain- 
independent linguistic analyses. For these approaches, summarization can be defined 
as the selection of a subset of the document sentences which is representative of its 
content. This is typically done by ranking the document sentences and selecting those 
with higher score and with a minimum overlap. Most of the recent work in summariza- 
tion uses this paradigm. Usually, sentences are used as text-span units but paragraphs 
have also been considered [18, 26]. The latter may sometimes appear more appealing 
since they contain more contextual information. Extraction based text summarization 
techniques can operate in two modes: generic summarization, which consists in ab- 
stracting the main ideas of a whole document and query-based summarization, which 
aims at abstracting the information relevant for a given query. 

Our work takes the text-span extraction paradigm. It explores the use of unsuper- 
vised and semi-supervised learning techniques for improving automatic summarization 
methods. The proposed model could be used both for generic and query-based sum- 
maries. However for evaluation purposes we present results on a generic summariza- 
tion task. Previous work on the application of machine learning techniques for summa- 
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rization [6, 8, 11, 13, 29] rely on the supervised learning paradigm. Such approaches 
usually need a training set of documents and associated summaries, which is used to 
label the document sentences as relevant or non-relevant for the summary. After train- 
ing, these systems operate on unlabeled text by ranking the sentences of a new docu- 
ment according to their relevance for the summarization task. 

The method that we use, to make the training of machine learning systems easier 
for this task, can be interpreted as an instance of the Classification EM algorithm 
(CEM) [5, 15] under the hypothesis of gaussian conditional class densities. However 
instead of estimating conditional densities, it is based on a discriminative approach for 
estimating directly posterior class probabilities and as such it can be used in a non 
parametric context. We present one algorithm upon on linear regression in order to 
compute posterior class probabilities. 

The paper is organized as follows, we first make a brief review of semi-supervised 
techniques and recent work in text summarization (Sect. 2). We present the formal 
framework of our model and its interpretation as a CEM instance (Sect. 3). We then 
describe our approach to text summarization based on sentence segment extraction 
(Sect. 4). Einally we present a series of experiments (Sect. 5). 



2 Related Work 

Several innovative methods for automated document summarization have been ex- 
plored over the last years, they exploit either statistical approaches [4, 26, 31] or lin- 
guistic approaches [9, 14, 22], and combinations of the two [2, 8]. We will focus here 
on a statistical approach to the problem and more precisely on the use of machine 
learning techniques. 

Erom a machine learning perspective, summarization is typically a task for which 
there is a lot of unlabelled data and very few labeled texts so that semi-supervised 
learning seems well suited for the task. Early work for semi supervised learning dates 
back to the 70s. A review of the work done prior to 88 in the context of discriminant 
analysis may be found in [15]. Most approaches propose to adapt the EM algorithm 
for handling both labeled and unlabeled and to perform maximum likelihood estima- 
tion. Theoretical work mostly focuses on gaussian mixtures, but practical algorithms 
may be used for more general settings, as soon as the different statistics needed for 
EM may be estimated. More recently this idea has motivated the interest of the ma- 
chine learning community and many papers now deal with this subject. Eor example 
[19] propose an algorithm which is a particular case of the general semi-supervised 
EM described in [15], and present an empirical evaluation for text classification. [16] 
adapt EM to the mixture of experts, [23] propose a Kernel Discriminant Analysis 
which can be used for semi-supervised classification. 

The co-training paradigm [3] is also related to semi supervised training. Our ap- 
proach bears similarities with the well-established decision directed technique, which 
has been used for many different applications in the field of adaptive signal processing 

[7]. 

Eor the text summarization task, some authors have proposed to use machine learn- 
ing techniques. [11] and [29] consider the problem of sentence extraction as a classifi- 
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cation task. [11] propose a generic summarization model, which is based on a Naive- 
Bayes classifier: each sentence is classified as relevant or non-relevant for the sum- 
mary and those with highest score are selected. His system uses five features: an indi- 
cation of whether or not the sentence length is below a specified threshold, occurrence 
of cue words, position of the sentence in the text and in the paragraph, occurrence of 
frequent words, and occurrence of words in capital letters, excluding common abbre- 
viations. 

[13] has used several machine learning techniques in order to discover features in- 
dicating the salience of a sentence. He addressed the production of generic and user- 
focused summaries. Features were divided into three groups: locational, thematic and 
cohesion features. The document database was CMP-LG also used in [29], which 
contains human summaries provided by the text author. The extractive summaries 
required for training were automatically generated as follows: the relevance of each 
document sentence with respect to the human summary is computed, highest score 
sentences are retained, for building the extractive summary. This model can be con- 
sidered both as a generic and a query-based text summarizer. 

[6] present an algorithm which generates a summary by extracting sentence seg- 
ments in order to increase the summary concision. Each segment is represented by a 
set of predefined features such as its location, the average term frequencies of words 
occurring in the segment, the number of title words in the segment. Then they compare 
three supervised learning algorithms: C4.5, Naive-Bayes and neural networks. Their 
conclusion is that all three methods successfully completed the task by generating 
reasonable summaries. 



3 Model 

In this section, we introduce an algorithm for performing unsupervised and semi- 
supervised learning. This is an iterative method that is reminiscent of the EM algo- 
rithm. At each iteration, it makes use of a regression model for estimating posterior 
probabilities that are then used for assigning patterns to classes or clusters. The unsu- 
pervised version of the algorithm may be used for clustering and the semi-supervised 
version for classifying. Both versions will be described using an unified framework. 
This algorithm can be shown to be an instance of the Classification EM (CEM) algo- 
rithm [5, 15] in the particular case of a gaussian mixture whose component densities 
have equal covariance matrices (section 3.3). In order to show that, we will make use 
of some basic results on linear regression and Bayes decision, they are introduced in 
section 3.2. Eor our application, we are interested in two class classification, we thus 
restrict our analysis to the two class case. 



3.1 Theoretical Framework 

We consider a binary decision problem where there are available a set of labeled data 
Di and a set of unlabelled data £)„. will always be non empty, whereas for unsuper- 
vised learning, D; is empty. 
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Formally we will note, D;={(x„ ti)\i=l,...,n] where x, e is the indicator 
vector for x, and Du={xi I i= n+l,...,n+m}. The latter are assumed to have been drawn 
from a mixture of densities with two components Ci, C 2 in some unknown proportions 
7ri,7T2. We will consider that unlabeled data have an associated missing indicator vector 
ti=(tu,t 2 i), (i=n+l, n+m) which is a class or cluster indicator vector. 



3.2 Discriminant Functions 

We give below some basic results on the equivalence between Bayes decision func- 
tions and linear regression that will be used for the interpretation of our learning algo- 
rithm as a CEM method. 

3.2.1 Bayesian Decision Rule for Normal Populations 

For two normal populations with a common covariance matrix NijiiiX) and N(jU 2 X) 
the optimal Bayesian discriminant function is [7]: 

8b(x) = + Xq ( 1 ) 

Where Xq is a given threshold. The decision rule is to decide Ci if gB(x)>0 and C 2 
otherwise. 



3.2.2 Linear Regression 

Let A be a matrix whose i"' row is the vector x, and Y be the corresponding vector of 
targets whose element is a if x,g C\ and b if x,e C 2 . For a and b chosen such that 

|cJ-t-|C2| |cj-t-|(72| 

ICil.a -I- IC 2 I.Z 7 = 0 , e.g. a =- — — t ^ and b= -- — — ^ the solution to the minimi- 

r r 

r i| r2| 

II iF 

zation of the mean squared error (MSE) T-W X is: 



W = \(m^-m2) ( 2 ) 

The corresponding discriminant function is : 

8r(x) = W (x-m) = a.(rrh- m 2 )' .YY\(x- m) (- 3 ^ 

where m*. and Z respectively denote the mean and the variance of the data for the 
partition C^and «is a constant (see e.g. [7]). and m is the mean of all of the samples. 
The decision rule is: decide Ci if g«(x)>0 and otherwise decide C 2 . 

By replacing the mean and covariance matrix in (1) with their plug in estimate used 
in (2), the two decision rules gs and gn are similar up to a threshold. The threshold 
estimate of the optimal Bayes decision rule can be easily computed from the data so 
that regression estimate could be used for implementing the optimal rule if needed. 
For practical applications however, there is no warranty that the optimal Bayesian rule 
will give better results. 
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3.3 Classification Maximum Likelihood Approach and Classification EM 
Algorithm 

In this section we will introduce the classification maximum likelihood (CML) ap- 
proach to clustering [28]. In this unsupervised approach there are N samples generated 
via a mixture density: 



f(x,e)= I 

k=i 



(4) 



Where the are parametric densities with unknown parameters Ok, c is the number 
of mixture components, Kk is the mixture proportion. The goal here is to cluster the 
samples into c components Pi, .., P^. Under the mixture sampling scheme, samples Xi 
are taken from the mixture density (4), and the CML criterion is [5, 15]: 

e N 

log^CML(^>^><9)= I I tki\og{7rk-fkixi,0k)] (5) 

k^\ 1-1 

Note that this is different from the mixture maximum likelihood (MML) approach 
where we want to optimize the following criterion: 

N c 

\0gL^{P,n,e)=\ log(l 7Vk.fk(Xi,ek)) (6) 

k=\ 1=1 



In the MML approach, the goal is to model the data distribution, whereas in the 
CML approach, we are more interested into clustering the data. For CML the mixture 
indicator 4, for a given data Xi is treated as an unknown parameter and corresponds to 
a hard decision on the mixture component identity. Many clustering algorithms are 
particular cases of CML [5, 24]. Note that CML directly provides a partition of the 
data, for MML a partition can be obtained by assigning x to the group with maximal 
posterior probability p{PJx). 

The classification EM algorithm (CEM) [5, 15] is an iterative technique, which has 
been proposed for maximizing (5), it is similar to the classical EM except for an addi- 
tional C-step where each x, is assigned to one and only one component of the mixture. 
The algorithm is briefly described below. 

CEM 

Initialization : start from an initial partition P^'^^ 

/* iteration,] > 0: 

E -step. Estimate the posterior probability that x, belongs to Pk (i=l,..., A; 
k=l,...,c): 



E[tlp /Xi-P‘'^\7t^j\0''^'>] 



7tp\fk{xr,dj^p 

c 



(7) 



k=l 
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C - step. Assign each Xi to the cluster with maximal a posteriori probability 

according to (7) 

M-step. Estimate the new parameters (tt 0 which maximize log Lcml(P 

;r«, 

CML can be easily modified to handle both labeled and unlabeled data, the only 
difference is that in (7) the for labeled data are known, (5) becomes: 

c c n+m 

logLc(P,n:,d)=l ! \og{7r,.f,(x,,d,)}+\ \ log{ ./, (x,- , )} (8) 

k=l x^eP/^ k=l i=n+l 

CEM can also be adapted to the case of semi supervised learning: for maximizing 
(8), the 4, for the labeled data are kept fixed and are estimated as in the classical CEM 
(E and C steps) for the unlabeled data. 



3.4 CEM and Linear Regression 

We will show now that CEM could be implemented using a regression approach in- 
stead of the classical density estimation approach. In CEM, parameters {/r 6 are 

used to compute the posterior probability so as to assign data to the different clusters 
in the C-step. However, instead of estimating these probabilities, one could use a re- 
gression approach for directly assigning data to the different clusters during the C- 
step. 

We will show that in the case of two normal populations with equal covariance ma- 
trices, these two approaches are equivalent. 

For a given partition corresponding to an iteration of the CEM algorithm, sup- 
pose we perform a regression of the input matrix X whose columns are the x, against 
the target vector Y whose i’' row is a if x g and h if x g with a and b as 

i I i 2 

described in section 3.2.2. In this case, the decision rule inferred from the regression 
estimation together with an appropriate threshold, will be the optimal Bayes decision 
rule with plug in maximum likelihood estimates. Using this decision rule derived from 
the linear regression, we could then assign the data to the different clusters and the 
(/' + !) 

partition P will be exactly the same as the one obtained from the classical mixture 
density version of CEM. 

Therefore the E-step in the CEM algorithm may be replaced by a regression step 
and the decision obtained for the C-step will be unchanged. 

Because we are interested here only in classification or clustering, if we use the re- 
gression approach, the M-step is no more necessary, the regression CEM algorithm 
could start from an initial partition, the step consists in classifying unlabeled data 
according to the decision rule inferred from the regression. It can easily be proved that 
this EM algorithm converges to a local maximum of the likelihood function (5) for 
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unsupervised training and (8) for semi-supervised training. The algorithm is summa- 
rized below: 

Regression-CEM 

Initialisation : start from an initial partition 

f' iteration, j > 0: 

E-step : compute W^-'^ = “'^ 2 ^') 

C-step : classify the XjS according to the sign(W*^*jf; + ) into or 

p(7+l) 

^2 

In the above algorithm, posterior probabilities are directly estimated via a discrimi- 
nant approach. All other approaches we know of for unsupervised or semi supervised 
learning rely on generative models and density estimation. This is an original aspect of 
our method and we believe that this may have important consequences. In practice for 
classification, direct discriminant approaches are usually far more efficient and more 
robust than density estimation approaches and allow for example to reach the same 
performance by making use of fewer labeled data. A more attractive reason for apply- 
ing this technique is that for non linearly separable data, more sophisticated regression 
based classifiers such as non linear Neural Networks or non linear Support Vector 
Machines may be used instead of the linear classifier proposed above. Of course, for 
such cases, the theoretical equivalence with the optimal decision rule is lost. 

Regression CEM can be used both for unsupervised and semi supervised learning. 
For the former, the whole partition is re-estimated at each iteration, for the latter, tar- 
gets of labeled data are kept fixed during all iterations and only unlabelled data are 
reassigned at each iteration. In the unsupervised case, the results will heavily depend 
on the initial partition of the data. 

For our text summarization application we have performed experiments for both 
cases. 

4 Automatic Text Summary System 

4.1 A Base Line System for Sentence Classification 

Many systems for sentence extraction have been proposed which use similarity meas- 
ures between text spans (sentences or paragraphs) and queries, e.g. [8, 13]. Represen- 
tative sentences are then selected by comparing the sentence score for a given docu- 
ment to a preset threshold. The main difference between these systems is the represen- 
tation of textual information and the similarity measures they are using. Usually, 
statistical and/or linguistic characteristics are used in order to encode the text 
(sentences and queries) into a fixed size vector and simple similarities (e.g. cosine) are 
then computed. 

We will build here on the work of [10] who used such a technique for the extraction 
of sentences relevant to a given query. They use a t/’-ir//' representation and compute 
the similarity between sentence s^ and query q as: 
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WiSSk,q 



log(#(H’i) + l) 

log(Md-l) # 



(9) 



Where, tf(w,x) is the frequency of term w mx {q or df(w) is the document fre- 
quency of term w and n is the total number of documents in the collection. Sentence Sk 
and query q are pre-processed by removing stop-words and performing Porter- 
reduction on the remaining words. For each document a threshold is then estimated 
from data for selecting the most relevant sentences. 

Our approach for the sentence extraction step is a variation of the above method 
where the query is enriched before computing the similarity. Since queries and sen- 
tences may be very short, this allows computing more meaningful similarities. Query 
expansion - via user feedback or via pseudo relevance feedback - has been success- 
fully used for years in Information Retrieval (IR) e.g. [30]. The query expansion pro- 
ceeds in two steps: first the query is expanded via a similarity thesaurus - WordNet in 
our experiments - second, relevant sentences are extracted from the document and the 
most frequent words in these sentences are included into the query. This process can 
be iterated. The similarity we consider is then: 



Sim2iq,s,^) 



! tf(w„q).tf(w„s^). 

WiSSi,,q 



\og{dfiw.) + l) 
log(n-l-l) # 



(10) 



Where, tf{w,q) is the number of terms within the “semantic” class of w, in the 
query q. This extraction system will be used as a baseline system for evaluating the 
impact of learning throughout the paper. Although it is basic, similar systems have 
been shown to perform well for sentence extraction based text summarization. For 
example [31] uses such an approach, which operates only on word frequencies for 
sentence extraction in the context of generic summaries, and shows that it compares 
well with human based sentence extraction. 



4.2 Learning 

We propose below a technique, which takes into account the coherence of the whole 
set of relevant sentences for the summaries and allows to significantly increasing the 
quality of extracted sentences. 

4.2.1 Features 

We define new features in order to train our system for sentence classification. A 
sentence is considered as a sequence of terms, each of them being characterized by a 
set of features. The sentence representation will then be the corresponding sequence of 
these features. 

We used four values for characterizing each term w of sentence s: tf(w,s), tf{w,q ) , 
(l-(log(<f/(w)H-l)/log(nH-l)) and Sim 2 (q,s) -computed as in (10)- the similarity between 
q and s. The first three variables are frequency statistics which give the importance of 
a term for characterizing respectively the sentence, the query and the document. The 
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last one gives the importance of the sentence containing w for the summary and is 
used in place of the term importance since it is difficult to provide a meaningful meas- 
ure for isolated terms [10]. 

4.2.2 The Learning Text Summary System 

In order to provide an initial partition for the semi-supervised learning we have 
labeled 10% of sentences in the training set using the news-wire summaries as the 
correct set of sentences. And for the unsupervised learning we have used the baseline 
system’s decision. We then train a linear classifier with a sigmoid output function to 
label all the sentences from the training set, and iterate according to algorithm regres- 
sion-CEM. 



5 Experiments 

5.1 Data Base 

A corpus of documents with the corresponding summaries is required for the evalua- 
tion. We have used the Reuters data set consisting of news-wire summaries [20]: this 
corpus is composed of 1000 documents and their associated extracted sentence sum- 
maries. The data set was split into a training and a test set. Since the evaluation is 
performed for a generic summarization task, collecting the most frequent words in the 
training set generated a query. Statistics about the data set collection and summaries 
are shown in table 1 . 

5.2 Results 

Evaluation issues of summarization systems have been the object of several attempts, 
many of them being carried within the tipster program [21] and the Summac competi- 
tion [27]. 



Table 1. Characteristics of the Reuters data set and of the corresponding summaries. 



Collection 


Training 


Test 


All 


# of docs 


300 


700 


1000 


Average # of sentences/doc 


26.18 


22.29 


23.46 


Min sentence/doc 


7 


5 


5 


Max sentence/doc 


87 


88 


88 


News- wire summaries 


Average # of sentences /sum 


4.94 


4.01 


4.3 


% of summaries including sentence of 
docs 


63.3 


73.5 


70.6 
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This is a complex issue and many different aspects have to be considered simultane- 
ously in order to evaluate and compare different summarizers [17]. 

Our methods provide a set of relevant document sentences. Taking all the selected 
sentences, we can build an extract for the document. For the evaluation, we compared 
this extract with the news-wire summary and used Precision and Recall measures, 
defined as follows: 



#of sentences extracted by the system which are in the news - wire summaries 

Precision = 

total # of sentences extracted by the system 

#of sentences extracted by the system which are in the news - wire summaries 

IvCCclll — 

total # of sentences in the news - wire summaries 



We give below the average precision (table 2) for the different systems and the 
precision/recall curves (figure 1). The baseline system gives bottom line performance, 
which allows evaluating the contribution of our training strategy. In order to provide 
an upper bound of the expected performances, we have also trained a classifier in a 
fully supervised way, by labeling all the training set sentences using the news-wire 
summaries. 

Unsupervised and Semi-supervised learning provides a clear increase of perform- 
ances (up to 9 %). If we compare these results to fully supervised learning, which is 
also 9% better, we can infer that with 10% of labeled data, we have been able to ex- 
tract from the unlabeled data half of the information needed for this "optimal" classifi- 
cation. 



Table 2. Comparison between the baseline system and different learning schemes, using linear 
sigmoid classifier. Performances are on the test set. 





Precision (%) 


Total Average (%) 


Baseline system 


54,94 


56,33 


Supervised learning 


72,68 


74,06 


Semi-Supervised learning 


63,94 


65,32 


Unsupervised learning 


63,53 


64,92 



We have also compared the linear Neural Network model to a linear SVM model in 
the case of unsupervised learning as shown at Table 3. The two models performed 
similarly, both are linear classifiers although their training criterion is slightly differ- 
ent. 

Table 3. Comparison between two different linear models: Neural Networks and SVM in the 
case of Self-supervised learning. Performances are on the test set. 





Precision (%) 


Total Average (%) 


Self-Supervised learning with 


63,53 


64,92 


Neural-Networks 






Self-Supervised learning with 


62,15 


63,55 


SVM 
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11 -point precision recall curves allow a more precise evaluation of the system be- 
havior. Let For the test set, let M be the total number of sentences extracted by the 
system as relevant (correctly or incorrectly), the total number of sentences ex- 
tracted by the system which are in the newswire summaries, Ng the total number of 
sentences in newswire summaries and Nt the total number of sentences in the test set. 

Precision and recall are computed respectively as NJM and NJNg. For a given 
document, sentence s is ranked according to the decision of the classifier. Precision 
and recall are computed for M = and plotted here one against the other as an 1 1 

point curve. The curves illustrate the same behavior as table 2, semi-supervised and 
unsupervised behave similarly and for all recall values their performance increase is 
half that of the fully supervised system. Unsupervised learning appears as a very 
promising technique since no labeling is required at all. Note that this method could 
be applied as well and exactly in the same way for query based summaries. 




Recall 



Fig. 1. Precision-Recall curves for base line system (square), unsupervised learning (star), 
semi-supervised learning (triangle) and the supervised learning (circle). 



6 Conclusion 

We have described a text summarization system in the context of sentence based ex- 
traction summaries. The main idea proposed here is the development of a fully auto- 
matic summarization system using a unsupervised and semi-supervised learning para- 
digm. This has been implemented using simple linear classifiers, experiments on 
Reuters news-wire have shown a clear performance increase. Unsupervised learning 
allows to reach half of the performance increase allowed by a fully supervised system, 
and is much more realistic for applications. It can also be used in exactly the same 
way for query based summaries. 
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Abstract. In this paper, we discuss an approach for discovering tempo- 
ral changes in event sequences, and present first results from a study on 
demographic data. The data encode characteristic events in a person’s 
life course, such as their birth date, the begin and end dates of their 
partnerships and marriages, and the birth dates of their children. The 
goal is to detect signihcant changes in the chronology of these events 
over people from different birth cohorts. To solve this problem, we en- 
coded the temporal information in a first-order logic representation, and 
employed Warmr, an ILP system that discovers association rules in a 
multi-relational data set, to detect frequent patterns that show signihcant 
variance over different birth cohorts. As a case study in multi-relational 
association rule mining, this work illustrates the hexibility resulting from 
the use of hrst-order background knowledge, but also uncovers a number 
of important issues that hitherto received little attention. 



1 Introduction 

In this paper, we study the problem of discovering patterns that exhibit a signif- 
icant change in their relative frequency of occurrence over time. As was already 
argued by in many domains the step beyond discovering frequent item sets 
to the discovery of second-order phenomena like the temporal change in these 
frequencies is of crucial importance. 

The analysis of life courses is such a problem. In the social sciences, and 
especially in demography and sociology, there has been a diffusion of the so-called 
life course approach [IS|. One of the principal interests in that approach is the 
study of how the lives of humans change as far as the age and the sequencing 
and the number of crucial events are concerned. To study the evolution of a 
whole society, it is common to analyze successive cohorts of people, i.e., groups 
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of people that were born in the same period of time (e.g. the same year or the 
same decade). 

Previous approaches to detecting change mostly propose special-purpose al- 
gorithms that had to treat time as special type of variable i™ . Instead, we 
suggest to address this problem by exploiting the power of a general, multi- 
dimensional data mining system. The system that we use — Warmr 0 — is based 
on the level-wise search of conventional association rule learning systems of the 
Apriori-family mm. It extends these systems by looking for frequent patterns 
that may be expressed as conjunction of first-order literals. This expressive power 
allows to encode temporal relationships fairly easily. In fact, the system does not 
need to discriminate between temporal relations and other domain-dependent re- 
lations, which is typical for conventional solutions to sequence discovery PCSI, 
and lets them co-exist in a natural and straight-forward way. 

2 The Dataset 

The data for our analysis originate from the Austrian Fertility and Family Survey 
(FFS), which was conducted between December 1995 and May 1996. In the 
survey, retrospective histories of partnerships, births, employment and education 
were collected for 4,581 women and 1,539 men between ages 20 and 54. Hence, the 
Austrian FFS covers birth cohorts from 1941 to 1976. The retrospective histories 
of partnerships and fertility for each respondent allow us to determine the timing 
of all births in the current and any previous union. Moreover, information about 
the civil status of each partnership in any month of observation is available, 
which allows us to determine whether a union started as a marriage or whether 
it was transformed into a marriage later on. 

We are interested in studying the main features that discriminate between 
the life courses of older and younger cohorts as far as the number of children, 
number and type of unions, fertility before and after unions, etc. are concerned. 
The present study should be considered a first step into that direction. 

3 Multi-relational Data Mining — Using Warmr for 
Discovering Temporal Changes 

The dataset under consideration is essentially a multi-relational dataset: a per- 
son’s life course is not described with a single tuple but by a set of tuples. Most 
common data mining algorithms expect the data to reside in a single table, and 
when the data are actually stored in a database with multiple tables they have 
to be preprocessed: one needs to derive a single table from the original data 
that hopefully retains as much of the original information as possible. This is 
a non-trivial task. Two directions are possible: one is to devise automatic pre- 
processing methods, another is to use data mining methods that can handle a 
multi-relational database, e.g., inductive logic programming (ILP) methods [TOj . 
Our approach falls into the second category as we employ the ILP system Warmr 
0 to detect interesting patterns. 
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female(159) . 
birth_date(159,6001) . 



subject (I) female (I), 



birth_date(I ,X) , 

X >= 4100, X<=6000. 



children (159, 2) . 

child_birth_date( 159, 7810) . 
child_birth_date( 159, 8706) . 



gen40(I) birth_date(I ,X) , 



X >= 4100, X<4600. 



unions (159 , 2) . 

union (159, 7805, 8106). 
union (159, 8306, -2). 



child_out_of _wedlock_at_age(I , A) : - 



birth_date (I ,X) , 
child_birth_date(I ,M) , 

\+ (marriage_date (I ,N) , N=<M) , 
A is (M-X)/100, A =< 35. 



marriage_date( 159, 7807) . 
marriage_date( 159, 8706) . 



Fig. 1. On the left, a Prolog encoding of a typical entry in the database is shown. This 
snapshot represents a female person with id 159 who was born in January 1960. Up 
to the interview date (« December 1995), she had formed two unions. The first lasted 
from May 1978 to June 1981, and the second started in June 1983 (and has not ended 
at the time of the interview). Both unions were converted to marriages (July 1978 and 
June 1987) and in each union one child was born (October 1978 and June 1987). 

The right half shows a few predicate definitions that operated on the basic data encod- 
ing. The subject/1 predicate served as a key, and could be used to filter the data (in 
this case to admit only persons from cohorts 1941-60). gen40/l encodes one of the class 
values (the cohort 1941-1945), and child_out_of _wedlock_at_age/2 shows an example 
for abstract background knowledge that can be defined upon the base predicates, as 
well as an example for censoring events above a certain age. 

As multi-relational data mining is a relatively unexplored field, few guidelines 
exist as to the methodology that should be followed. Consequently, this section 
describes the complete data mining process that we employed for addressing our 
problem. On the way, we will also describe the workings of Warmr, the first-order 
data mining system that we used, as well as the way the data was represented 
for the use by this system. Although the following sections will only describe 
the final data representation, it should be noted that this was not the result of 
a top-down design, but of repeated exploration of several options. 

3.1 Data Preprocessing 

Data preprocessing consisted of several steps. First, the data were converted into 
a Prolog format. A typical entry is shown in the left half of Fig. Q1 From then 
on, preliminary experiments could be run, which were helpful to further improve 
the quality of the domain representation. 

To this base representation, we added several pieces of background knowledge 
in the form of Prolog predicates that operate upon these operational predicates. 
These were mostly conversions from dates to ages (which is needed when one 
wants to find, e.g., changes in the average age of marrying etc.), but also included 
complex high-level predicates such as child_out_of _wedlock_at_age/2, which 
encodes whether a person had a child out of wedlock. Most importantly, we 
included the </2 predicate, which allowed the system to compare ages at which 
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Fig. 2. Relative frequencies of people that 
had a union before they married. The 
steady increase up to 1960 is a true reg- 
ularity, while the decline for the cohorts 
1960 — 1975 is due to the fact that for 
increasingly many individuals of these co- 
horts, union formation and marriage has 
not yet taken place in their lives at the 
time of the interview (« December 1995), 
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events take place in people’s life courses, and thus to describe their chronology. 
Note that this approach does not only allow to find sequences in the strict sense, 
but also partially ordered events, such as A<B, A<C, where the order of the 
events B and C is left open (both A<B<C and A<C<B are possible). It is precisely 
this facility which makes the use of a relational data mining system necessary, 
because this functionality cannot be achieved with a conventional, propositional 
association rule finder without using a pre-processing phase that encodes all 
possible event sequences as separate features. Note that such a procedure would 
basically reduce the task of the frequent pattern discovery to a single pass over 
all possible, pre-compiled sequences, which defeats the purpose of the efficient, 
level- wise search. 

In addition, the flexibility of the first-order background knowledge also facil- 
itated censoring of the data. In social sciences, this term describes the situation 
that, when using data collected from interviews where people are asked about 
their past experience of events, only their life courses up to the age at interview 
are available to the analyst. Such a situation can, e.g., be seen in the decline of 
frequencies of people that formed a union before they were married (Fig.|3). As 
people born after 1965 were, at the time of the interview (f« December 1995) 
30 years or younger, it is quite natural that the probability for these people to 
have experienced certain events or event sequences in their life course is not the 
same as for people in their forties or older. Consequently, in preliminary experi- 
ments we discovered many rules because of the comparably low frequencies for 
marriage, union formation, and child birth for people born in the seventies. 

After some experimentation, we decided to censor items in the following way: 

— only people born in the forties or fifties were retained 

— events in people’s lives that occurred after the age of 35 were ignored 

Censoring was easily done by adding inequations in the relevant definitions of 
background knowledge (like in the last rule of Fig. EJ. Naturally, looking only at 
events happening before the age of 35 severely limits our study, but increasing 
this age limit could only be done at the expense of reducing the data set (in 
order to increase the age limit to 40 and yet avoid artificial patterns, the set 
of subjects would need to be reduced to those people born before 1956). Our 
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choice is a compromise between keeping as much data as possible and looking 
at a reasonable set of events in people’s lives. 

The original dataset contained 6120 entries, 1539 representing male subjects, 
and 4581 representing female subjects. As the distribution of male and female 
subjects is obviously skewed, we decided to only work with the larger, female 
groupQ Again, this could easily be achieved by adding an appropriate condition 
in subject/2 (see Fig.0), which defines the key (as defined below). 

3.2 Discovery of Frequent Patterns 

Warmr |SI is an ILP system that discovers frequent patterns in a data set, where 
a “pattern” is defined as a conjunction of first order literals. For instance, in our 
data set patterns may look like this: 

subject(S), married_at_age(S,A) , child_at_age (S ,B) , B>A. 

subject(S), child_at_age(S,A) , child_at_age(S,B) , B>A. 

The first pattern describes a person S who married and subsequently had 
a child; the second one describes a person who had at least two children (with 
different ages). 

Each pattern is associated with a so-called key. the frequency of a pattern is 
determined by counting the number of key items (i.e., the number of instantia- 
tions of the key variable) that satisfy the constraints stated in the pattern. In 
the above example, it is natural to have the subject S as the key. 

Warmr can be considered a first-order upgrade of Apriori P; it performs a 
top-down level-wise search, starting with the key and refining patterns by adding 
literals to them. Infrequent patterns (i.e. patterns of which the frequency is below 
some predefined threshold) are pruned as are their refinements. We refer to jO] 
for more details. 

3.3 Discovering Temporal Change 

As described above, Warmr finds all patterns that have a frequency above a user- 
specified threshold. In a second phase, Warmr can combine these patterns into 
first order association rules; basically, these rules are of the form “if LHS occurs, 
then RHS occurs with probability c”. Contrary to conventional, propositional 
association rule finders, patterns may be formulated in first-order logic, which 
allows conditions to be linked together by sharing the same variables. 

In order to limit the number of possible rules, the current version of Warmr 
expects that the user provides a list of possible patterns that can occur on 
the left-hand side (LHS) or right-hand side (RHS) of a rule. This is done with 
Warmr’s classes setting. More specifically, if there is a frequent pattern P and 
subsets A and B, such that A C P is one of the classes and B = P\A, then 

^ The over-sampling of female individuals was intentional in the family and fertility 
survey from which the data originate because fertility is more closely linked to female 
life courses. 
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Warmr returns the rules A => B and B => A (assuming they fulfill possible 
additional criteria, such as achieving a minimal confidence level). 

In our application, we used the classes to separate people into different co- 
horts. After some experimentation, we decided to use four different cohorts, 
each encompassing a 5-year span. In Warmr, this could be simply encoded using 
generation literals as classes: 

classes ( [gen40(_) , gen45(_), gen50(_), gen55 (_)]). 

This tells Warmr that we expect to see rules of the following typ^ 



gen40(S) => subject(S), child_at_age(S,A) , child_at_age(S,B) , B>A. 



The rules are ordered according to their interestingness. Interestingness is defined 
in Warmr as the number of standard deviations the observed frequency of a 
pattern in an age group differs from the average frequency in the whole data 
set under consideration. More precisely, for a rule A => B, interestingness 
d{A => B) is defined as 



d{A => B) 



piB\A)-p{B) 






p(B)(l-p(B)) 

n(A) 



n(AAB) 

AA) 



AB) 

N 






.(B) 



( 1 - 



^(B) 



AA) 



) 



where p{B) is the probability/relative frequency of the pattern B in the 

entire data set, whereas p{B\A) m is the probability /relative frequency 

of the same pattern occurring in the subgroup defined by the class A. Hence, 
the numerator computes the deviation of the expected frequency of pattern B 
(if it were independent of A) from its actual frequency of occurrence in the 
subgroup defined by pattern A. This difference is normalized with the standard 
deviation that could be expected if p{B) were the true frequency of occurrence 
of B within A. Thus, d{A => B) computes the number of standard deviations 
that the actual frequency is away from the expected frequency. In our case, 
the interestingness measure compares the expected number of individuals that 
satisfy a certain pattern in a cohort to the expected number of individuals that 
should satisfy this pattern if the occurrence of the pattern were independent 
from the birth cohort. 



3.4 Filtering of Semantic Redundancy 

Many of the discovered rules are syntactically different but semantically equiva- 
lent because redundant conditions are added to a rule (if, e.g., a constraint that 
specifies that a person has more than 3 children is followed by redundant tests for 
having more than 2 or 1 children) or the same situation is expressed in different 

^ As mentioned above, Warmr will also produce rules of the form subject (S), 
child_at_age(S,A) . . .=> gen40(S) . because the evaluation of rules A —> B and 
B => A may differ. However, as characteristic rules of the form gen40(S) => 
pattern are more natural to interpret, we applied a filter to Warmr’s output that 
only retained this kind of rules. 
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Number of Patterns and Rules per Level 







Fig. 3. The number of found pat- 
terns, the number of found rules 
and the number of semantically 
unique rules that remain after fil- 
tering. The scale on the j/-axis is 
logarithmic. 



ways (e.g., the above-mentioned constraint on the number of children can also be 
formulated using an equivalent number of child_at_age/2 predicates together 
with inequalities that assure that they refer to successive events). Some (but not 
all) of these cases can be addressed by Warmr’s configurable language bias (e.g., 
by specifying that only one constraint on the number of children is admissible in 
a rule) and its constraint specification language (e.g., by specifying that literal 
p(X) must not be added if literal q(X) already occurs). We will return to this 
issue in Sect. 0 

To reduce the number of rules, we employ a simple filtering strategy: for rules 
that share the same frequencies for all its components {n{A), n{B), n{A A B)) 
and hence have the same measure of interestingness d{A => B), we simply 
assume that they are semantically equivalent. In such cases we automatically 
removed all rules except those that were found at the minimum level (i.e., all 
but the shortest rules) jj 

Figure 0 shows the number of frequent patterns found in one of our exper- 
iments, the number of rules generated from these patterns, and the number of 
rules that survived the filtering process. What also becomes apparent is that the 
number of irrelevant and redundant patterns increases with the level. At level 
10, where the largest number of frequent patterns is discovered (1795), only four 
rules survive the filtering, and at subsequent levels none remain (0 frequencies 
are shown as 1 in the log-scale). 

3.5 Visualizing the Results 

Rules reported by Warmr to be interesting (in the sense that they describe pat- 
terns whose frequency in a certain age group deviates significantly from its av- 

® A better approximation of semantical equivalence would be to consider the actual 
sets of covered instances, instead of just their size. However, after inspecting a num- 
ber of rules with the same interesting measure, we found no examples where rules 
with the same interestingness measure were not semantically equivalent, so we did 
not consider it worthwhile to implement this more accurate approximation. Concern- 
ing the choice of the simplest rule, when several rules had the same interestingness 
measure and the same complexity, we arbitrarily decided for one of them. 
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Fig. 4. The trends behind a very simple pattern and a very complex pattern. The 
left graph shows the negative trend in people that started their first union when they 
married. The pattern behind the right graph is explained in the text. 



erage frequency in the entire population) were inspected manually, looking at 
the most interesting rules first. The temporal change of interesting patterns was 
visualized by plotting the frequency of occurrence of the pattern over 1-year 
cohorts from 1940 to 1960. This was very useful for assessing whether the found 
anomaly for one age group represented a trend throughout the years or only 
a temporary change in behavior. This phase could be enriched using a query 
language for trend shapes, such as the one proposed by ng. 



4 Selected Results 

In this section, we report on some of the results obtained in our case study. 
For these experiments we used the Warmr algorithm as implemented in the 
data mining tool ACE-ilProlog 0 , version 1.1.6. We used the default settings 
for Warmr, except for a minimal support of 0.01 (the default is 0.1). No minimal 
confidence was specified for the generated rules. 

The used background knowledge allowed the system to consider the events 
of child birth, start of a marriage and start of a union, and to order these events 
using inequalities, as well as test whether events occurred at the same time. To 
prevent infinite chains, the latter tests were restricted to ages originating from 
different event types (e.g., the system was not allowed to test the equality of two 
marriage dates), and only one equality test per pair of event types was allowed. 
In addition, all rules were initialized with the date of the first union. With this 
language, a Warmr run that went 16 levels deep took approximately 6 hours. 

Figure0shows two discovered trends, one originating from a very simple rule, 
and the second from a fairly complex rule. The first pattern is the relation that 
people formed their first union when they married. This pattern was found with a 
negative deviation of 8.07 for the cohort 1956-1960, and with a positive deviation 
of 7.05 for the cohort 1941-1945. Together with its counter-part (Fig. EJ, this 
pattern showed the strongest deviation of all discovered rules. Near these two 
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Fig. 5. People that have a child between their first union and a marriage over all people 
(left) and the same group over those people that had a marriage after their first union. 



rules were several similar rules, that add minor restrictions to these rules (e.g., 
marrying at the time of the first union and having a child at a later time). 

The second rule is shown below; it was found at level 7, with negative devi- 
ation of 2.17. Only a few rules at levels > 8 survived the filtering process and 
none of them had a deviation of more than 2 standard deviations. 

gen55(A) subject(A), f irst_union_at_age(A,B) , 

marr ied_at_age ( A , C) , OB , child_at_age ( A , D) , D>C , 
child_at_age(A,E) ,E>B, 

married_at_age(A,F) ,F>E, child_at_age(A,G) ,G>E] 

The first part of this rule states that persons satisfying this pattern married 
after their first union and had a child thereafter. The second part specifies that 
they had a child after their first union and after this child, they had both a 
marriage and a second child. Note that not all of the events in this rule are 
strictly ordered, and it is sometimes left open whether they refer to identical or 
different events. For example, both marriages referred to in the rule could be 
bound to the same marriage, in which case the pattern describes women that 
had at least two children, at least one of them before the marriage. However, 
all 6 event predicates could also refer to different events. The partial ordering 
of these events makes these patterns more general and may result in interesting 
combinations of subgroups. 

More complicated rules have to be interpreted with caution, as can be seen 
from Fig. 0 Its left graph shows the trend for people that have a child between 
their first union and a marriage as discovered by the system, but the second 
graph, which normalizes this number over the number of people that had their 
first union before they married, shows that trend was mostly due to the trend 
in the normalizing pattern. 

5 Lessons Learnt 

An interesting, somewhat unexpected result of our explorations was that adding 
domain knowledge not only slows down the system but does not necessarily 
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yield more interesting results. The reason for this is that frequently the same 
relationships were also found in the form of the conjunctions of the operational 
predicates that were used to define the high-level predicates in the first place. So 
while we started our exploration by adding high-level predicates that appeared 
to be interesting to us (such as child_out_of _wedlock_at_age/2), we even- 
tually ended up in removing most of these patterns because they unnecessarily 
slowed down the system’s performance^ Warmr currently allows to formulate 
simple, syntax-oriented constraints (such as “do not add literal p(X) if the lit- 
eral q(X) already occurs”), but this is insufficient for our application, where, 
e.g., the transitivity of < causes many dependencies that cannot be avoided us- 
ing the current constraint specification language. A more powerful language for 
expressing semantic constraints is clearly a promising topic for further research. 

In particular for higher values of the frequency threshold, we often encoun- 
tered the problem that negative deviations are often missed by the system. The 
reason for this is that patterns that occur significantly less frequently than ex- 
pected are often below the frequency threshold. This is not a big problem in our 
application because a negative deviation for, e.g., the early cohorts 1941-1945 is 
typically paired with a positive deviation for the later cohorts 1956-1960, which 
is easily detected by the system. As we visually inspect the found patterns over 
all cohorts, it does not matter which deviation is found by the system. 

In general, however, these issues may hint at an important shortcoming: in 
domains where frequency thresholds cannot be applied for effectively reducing 
the number of candidates (or can only reduce them at the expense of loosing 
interesting patterns) the level-wise search pioneered by Apriori might in fact 
not be a suitable choice. In particular if the search is performed in memory, 
alternative search strategies might be considered because of their more 

flexible pruning strategies. It is an open research problem which strategy is 
more appropriate for multi-relational data mining systems like Warmr. 



6 Related Work 

The problem of detecting temporal change in frequent patterns by grouping 
objects over time was also studied by 0). In their application, they discovered 
trends in student admissions to UCI in the years 1993-1998. Their approach 
was limited to detecting changes in propositional patterns (which they called 
contrast sets), while we consider change in temporal patterns. 

The problem of discovering frequent episodes in event sequences could also be 
solved by other techniques These could then be post-processed with simi- 

lar techniques for detecting deviations over time. In fact, H2] discuss such a two- 
phase solution to the problem of discovering trends in text databases. The first 
phase consists of discovering frequent phrases in documents, i.e., in sequences of 
words using the advanced sequence discovery algorithm described in m. In the 

The problem occurs less frequently in propositional tasks, but there too it is rec- 
ognized that if such dependencies do exist, one needs special techniques to handle 
them 13 . 
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second phase, the frequencies of the phrases over given time groups are deter- 
mined, and their shape can be queried using the query language described in |21, 
which could, in principle, be replaced by our technique of detecting significant 
deviations from the mean over different periods of time. 

The above-mentioned approaches basically treat the problem as a basket 
analysis problem with time as a special, designated variable that allows to inte- 
grate multiple baskets into single rules, as long as certain temporal constraints 
(e.g., a maximum window size) are followed. Our approach, however, is more 
general in the sense that it does not give a special role to time. Temporal re- 
lations are represented in the same way as other domain-dependent knowledge 
(even though this was not the main focus of this particular application). As a 
consequence, we are not only searching for sequential patterns (i.e., strict tempo- 
ral orders of the form A<B<C), but for more general, graph- like structures (such 
as A<B and A<C). 

Several people have recognized the importance of taking dependencies into 
account when searching for association rules, and proposed solutions, e.g., in the 
form of itemset closures [Zl; some of that work is currently being extended to 
first-order association rules [De Raedt; personal communication], but we have 
no knowledge of existing publications in this area. 



7 Conclusions 

In this paper, we demonstrated a way for exploiting the generality of Warmr, a 
multi-relational data mining system, for the task of discovering temporal changes 
in sequential patterns. The generality of Warmr allows a straight-forward encod- 
ing of time-dependent information and a seamless integration with additional 
background knowledge, which facilitates ease of experimentation and flexibility 
in incorporating new knowledge. In particular, during our experimentation, we 
frequently changed the system’s view of the data, e.g., by censoring recent co- 
horts and late events in people’s life courses. Such changes could be handled by 
simple changes in the background knowledge, while the underlying data repre- 
sentation could remain the same. 

On the other hand, our current experiments clearly show the need for han- 
dling dependencies in first order association rule mining in general. A specifically 
interesting consequence of this is that the use of background knowledge may hurt 
the performance of systems such as Warmr, rather than improving it. This is an 
important issue, as the possibility to use background knowledge is one of the ad- 
vantages of ILP approaches. In the current version of Warmr, dependencies can 
to some extent be handled using syntactical constraints on clauses, but this is 
clearly insufficient to handle semantic dependencies between background knowl- 
edge definitions (e.g. the transitivity of <). Further research is needed to address 
such problems in first-order association rule mining. 
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Abstract. The biological sciences are undergoing an explosion in the 
amonnt of available data. New data analysis methods are needed to deal 
with the data. We present work using KDD to analyse data from mutant 
phenotype growth experiments with the yeast S. cerevisiae to predict 
novel gene functions. The analysis of the data presented a number of chal- 
lenges: multi-class labels, a large number of sparsely populated classes, 
the need to learn a set of accurate rules (not a complete classification), 
and a very large amount of missing values. We developed resampling 
strategies and modified the algorithm C4.5 to deal with these problems. 
Rules were learnt which are accurate and biologically meaningful. The 
rules predict function of 83 putative genes of currently unknown function 
at an estimated accuracy of > 80%. 



1 Introduction 



The biological sciences are undergoing an unprecedented increase in the amount 
of available data. In the last few years the complete genomes of ~30 microbes 
have been sequenced, as well as that of “the worm” ( C. elegans) and “the fly” {D. 
melanogaster) . The last few months have seen the sequencing of the first plant 
genome Arabidopsis Eg, and the greatest prize of all, the human genome EOg. 
In addition to data from sequencing, new post genomic technologies are enabling 
the large-scale and parallel interrogation of cell states under different stages of 
development and under particular environmental conditions, generating very 
large databases. Such analyses may be carried out at the level of mRNA using 
micro-arrays (e.g. jSE]) (the transcriptome) . Similar analyses may be carried out 
at the level of the protein to define the proteome (e.g. | 2 ]), or at the level of small 
molecules, the metabolome (e.g. 123). This data is replete with undiscovered 
biological knowledge which holds the promise of revolutionising biotechnology 
and medicine. KDD techniques are well suited to extracting this knowledge. 
Currently most KDD analysis of bioinformatic data has been based on using 
unsupervised methods e.g. |SI17I32I| . but some has been based on supervised 
methods usini. New KDD methods are constantly required to meet the new 
challenges presented by new forms of bioinformatic data. 

Perhaps the least analysed form of genomics data is that from phenotype 
experiments [251221 1 . In these experiments specific genes are removed from the 
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cells to form mutant strains, and these mutant strains are grown under different 
conditions with the aim of finding growth conditions where the mutant and the 
wild type (no mutation) differ (“a phenotype”). This approach is analogous to 
removing components from a car and then attempting to drive the car under 
different conditions to diagnose the role of the missing component. 

In this paper we have developed KDD techniques to analyse phenotype ex- 
periment data. We wish to learn rules that given a particular set of phenotype 
experimental results predict the functional class of the gene mutated. This is an 
important biological problem because, even in yeast, one the best characterised 
organisms, the function of 30-40% of its genes are still currently unknown. 

Phenotype experiment data presents a number of challenges to standard data 
analysis methods: the functional classes for genes exist in a hierarchy, a gene may 
have more than one functional class, and we wish to learn a set of accurate rules 
- not necessarily a complete classification. The recognition of functional class hi- 
erarchies has been one of the most important recent advances in bioinformatics 
[129111131 . For example in the Munich Information Center for Protein Sequences 
(MIPS) hierarchy (http://mips.gsf.de/proj/yeast/catalogues/funcat/) the top 
level of the hierarchy has classes such as: “Metabolism” , “Energy” , “Transcrip- 
tion” and “Protein Synthesis” . Each of these classes is then subdivided into more 
specific classes, and these are in turn subdivided, and then again subdivided, so 
the hierarchy is up to 4 levels deep. An example of a subclass of “Metabolism” 
is “amino-acid metabolism” , and an example of a subclass of this is “amino-acid 
biosynthesis”. An example of a gene in this subclass is YPR145w (gene name 
ASNl, product “asparagine synthetase”). In neither machine learning or statis- 
tics has much work has been done on classification problems where there is a 
class hierarchy. However, such problems are relatively common in the real world, 
particularly in text classification 1131341311 . We deal with the class hierarchy by 
learning separate classifiers for each level. This simple approach has the unfor- 
tunate side-effect of fragmenting the class structure and producing many classes 
with few members - e.g. there are 99 potential classes represented in the data 
for level 2 in the hierarchy. We have therefore developed a resampling method 
to deal with the problem of learning rules from sparse data and few examples 
per class. 

Perhaps an even greater difficulty with the data is that genes may have more 
than one functional class. This is reflected in the MIPS classification scheme 
(where a single gene can belong to up to 10 different functional classes). This 
means that the classification problem is a multi-label one (as opposed to multi- 
class which usually refers to simply having more than two possible disjoint classes 
for the classifier to learn). There is only a limited literature on such problems, for 
example |i3t10l3U| . The UCI repository P| currently contains just one dataset 
(“University”) that can be considered a multi-label problem. (This dataset shows 
the academic emphasis of individual universities, which can be multi-valued, for 
example, business-education, engineering, accounting and fine-arts). The sim- 
plest approach to the multi-label problem is to learn separate classifiers for each 
class (with all genes not belonging to a specific class used as negative examples 
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for that class). However this is clearly cumbersome and time-consuming when 
there are many classes - as is the case in the functional hierarchy for yeast. Also, 
in sparsely populated classes there would be very few positive examples of a 
class and overwhelmingly many negative examples. We have therefore developed 
a new algorithm based on the successful decision tree algorithm C4. 5 m- 

A third challenge in prediction of gene function from phenotype data is that 
we wish to learn a set of rules which accurately predict functional class. This dif- 
fers from the standard statistical and machine learning supervised learning task 
of maximising the prediction accuracy on the test set. The problem resembles in 
some respects association rule learning in data mining. 

In summary our aim is to discover new biological knowledge about: 

— the biological functions of genes whose functions are currently unknown 

— the different discriminatory power of the various growth conditions under 
which the phenotype experiments are carried out 

For this we have developed a specific machine learning method which handles 
the problems provided by this data: 

— many classes 

— multiple class labels per gene 

— the need to know accuracies of individual rules rather than the ruleset as a 
whole 

2 Experimental Method 

2.1 Data 

We used three separate sources of phenotypic data: TRIPLES [IB|, EUROFAN 

E3 and MIPS Eg. 

— The TRIPLES (TRansposon-Insertion Phenotypes, Localization and Ex- 
pression in Saccharomyces) data was generated by randomly inserting trans- 
posons into the yeast genome. 

URLs: http://ygac.med.yale.edu/triples/triples.htm, (raw data) 
http://bioinfo.mbb.yale.edu/genome/phenotypes/ (processed data) 

— EURO FAN (European functional analysis network) is a large European net- 
work of research which has created a library of deletion mutants by using 
PCR-mediated gene replacement (replacing specific genes with a marker gene 
(kanMX)). We used data from EUROFAN 1. 

URL: http://mips.gsf.de/proj/eurofan/ 

— The MIPS (Munich Information Center for Protein Sequences) database 
contains a catalogue of yeast phenotype data. 

URL: http://mips.gsf.de/proj/yeast/ 

The data from the three sources were concatenated together to form a unified 
dataset, which can be seen at http://users.aber.ac.uk/ajc99/phenotype/. The 
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phenotype data has the form of attribute- value vectors: with the attributes being 
the growth media, the values of the attributes being the observed sensitivity or 
resistance of the mutant compared with the wildtype, and the class the functional 
class of the gene. Notice that this data will not be available for all genes due 
to some mutants being inviable or untested, and not all growth media were 
tested/recorded for every gene, so there were very many missing values in the 
data. 

The values that the attributes could take were the following: 
n no data 

w wild- type (no phenotypic effect) 
s sensitive (less growth than for the wild-type) 
r resistance (better growth than for the wild-type) 

There were 69 attributes, 68 of which were the various growth media (e.g. 
calcofluor .white, caffeine, sorbitol, benomyl), and one which was a discretised 
count of how many of the media this mutant had shown a reaction to (i.e. for 
how many of the attributes this mutant had a value of “s” or “r” ) . 



2.2 Algorithm 



The machine learning algorithm we chose to adapt for the analysis of phenotype 
data was the well known decision tree algorithm C4.5 |2^. C4.5 is known to 
be robust, and efficient The output of C4.5 is a decision tree, or equiva- 
lently a set of symbolic rules. The use of symbolic rules allows the output to be 
interpreted and compared with existing biological knowledge - this is not gener- 
ally the case with other machine learning methods, such as neural networks, or 
support vector machines. 

In C4.5 the tree is constructed top down. For each node the attribute is 
chosen which best classifies the remaining training examples. This is decided 
by considering the information gain, the difference between the entropy of the 
whole set of remaining training examples and the weighted sum of the entropy 
of the subsets caused by partitioning on the values of that attribute. 



information_gain{S,A) = entropy(S) — 



W * entropy(Sy) 

veA ' ' 



where A is the attribute being considered, S is the set of training examples being 
considered, and is the subset of S with value v for attribute A. The algorithms 
behind C4.5 are well documented and the code is open source, so this allowed 
the algorithm to be extended. 

Multiple labels are a problem for C4.5, and almost all other learning methods, 
as they expect each example to be labeled as belonging to just one class. For 
yeast this isn’t the case, as a gene may belong to several different classes. In the 
case of a single class label for each example the entropy for a set of examples is 
just 

N 

entropy (S) = - '^p{ci) logp(ci) 

i=l 
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where p{ci) is the probability (relative frequency) of class ct in this set. 

We need to modify this formula for multiple classes. Entropy is a measure of 
the amount of uncertainty in the dataset. It can be thought of as follows: Given 
an item of the dataset, how much information is needed to describe that item? 
This is equivalent to asking how many bits are needed to describe all the classes 
it belongs to. 

To estimate this we sum the number of bits needed to describe membership 
or non-membership of each class (see appendix for intuition) . In the general case 
where there are N classes and membership of each class Ci has probability p{ci) 
the total number of bits needed to describe an example is given by 

N 

-^{{p{ci)^ogp{a)) + {q{a) log q{c^))) 

i=l 

where 

p{ci) = probability (relative frequency) of class Ci 

q{ci) = 1 — p{ci) = probability of not being member of class Ci 

Now the new information after a partition according to some attribute, can 
be calculated as a weighted sum of the entropy for each subset (calculated as 
above), where this time, weighted sum means if an item appears twice in a subset 
because it belongs to two classes then we count it twice. 

In allowing multiple labels per example we have to allow leaves of the tree 
to potentially be a set of class labels, i.e. the outcome of a classification of an 
example can be a set of classes. When we label the decision tree this needs to be 
taken into account, and also when we prune the tree. When we come to generate 
rules from the decision tree, this can be done in the usual way, except when it is 
the case that a leaf is a set of classes, a separate rule will be generated for each 
class, prior to the rule-pruning part of the C4.5rules algorithm. We could have 
generated rules which simply output a set of classes - it was an arbitrary choice 
to generate separate rules, chosen for comprehensibility of the results. 

2.3 Resampling 

The large number of classes meant that many classes have quite small numbers 
of examples. We were also required only to learn a set of accurate rules, not a 
complete classification. This unusual feature of the data made it necessary for 
us to develop a complicated resampling approach to estimating rule accuracy 
based on the bootstrap. 

All accuracy measurements were made using the m-estimate 0 which is a 
generalisation of the Laplace estimate, taking into account the a priori proba- 
bility of the class. The m-estimate for rule r (M(r)) is: 

M(r) = 

p + n + m 



where 
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P = total number of positive examples, 

N = total number of negative examples, 
p = number of positive examples covered by rule r, 
n = number of negative examples covered by rule r 
m = parameter to be altered 

Using this formula, the accuracy for rules with zero coverage will be the a 
priori probability of the class, m is a parameter which can be altered to weight 
the a priori probability. We used m=l. 

The data set in this case is relatively small. We have 2452 genes with some 
recorded phenotypes, of which 991 are classified by MIPS as “Unclassified” or 
“Classification not yet clear-cut” . These genes of unknown classification cannot 
be used in supervised learning (though we can later make predictions for them) . 
This leaves just 1461, each with many missing values. At the top level of the 
classification hierarchy (the most general classes), there are many examples for 
each class, but as we move to lower, more specific levels, the classes become more 
sparsely populated, and machine learning becomes difficult. 

We aimed to learn rules for predicting functional classes which could be inter- 
preted biologically. To this end we split the data set into 3 parts: training data, 
validation data to select the best rules from (rules were chosen that had an accu- 
racy of at least 50% and correctly covered at least 2 examples), and test data. We 
used the validation data to avoid overfitting rules to the data. However, splitting 
the dataset into 3 parts means that the amount of data available for training 
will be even less. Similarly only a small amount will be available for testing. 
Initial experiments showed that the split of the data substantially affected the 
rulesets produced, sometimes producing many good rules, and sometimes none. 
The two standard methods for estimating accuracy under the circumstance of 
a small data set are 10-fold cross-validation and the bootstrap method nm. 
Because we are interested in the rules themselves, and not just the accuracy, we 
opted for the bootstrap method, because a 10-fold cross validation would make 
just 10 rulesets, whereas bootstrap sampling can be used to create hundreds of 
samples of the data and hence hundreds of rulesets. We can then examine these 
and see which rules occur regularly and are stable, not just artifacts of the split 
of the data. 

The bootstrap is a method where data is repeatedly sampled with replace- 
ment to make hundreds of training sets. A classifier is constructed for each 
sample, and the accuracies of all the classifiers can be averaged to give a final 
measure of accuracy. First a bootstrap sample was taken from the original data. 
Items of the original data not used in the sample made up the test set. Then a 
new sample was taken with replacement from the sample. This second sample 
was used as training data, and items that were in the first sample but not in the 
second made up the validation set. All three data sets are non-overlapping. 

We measured accuracy on the held-out test set. We are aware that this will 
give a pessimistic measure of accuracy (i.e. the true accuracy on the whole data 
set will be higher), but this is acceptable. 



48 



A. Clare and R.D. King 



3 Results 

We attempted to learn rules for all classes in the MIPS functional hierarchy 
http://mips.gsf.de/proj/yeast/catalogues/funcat/, using the catalogue as it was 
on 27 September 1999. 500 bootstrap samples were made, and so C4.5 was run 
500 times and 500 rulesets were generated and tested. To discover which rules 
were stable and reliable we counted how many times each rule appeared across 
the 500 rulesets. Accurate stable rules were produced for many of the classes at 
levels 1 and 2 in the hierarchy. At levels 3 and 4 (the most specific levels with the 
least populated classes) no useful rules were found. That is, at the lower levels, 
few rules were produced and these were not especially general or accurate. 

The good rules are generally very simple, with just one or two conditions 
necessary to discriminate the classes. This was expected, especially since most 
mutants were only sensitive/resistant to a few media. Some classes were far easier 
to recognise than others, for example, many good rules predicted class “CELLU- 
LAR BIOGENESIS” and its subclass “biogenesis of cell wall (cell envelope)” . 

Some examples of the rules and their accuracies follow. The full set of rules 
can be seen at http://users.aber.ac.uk/ajc99/phenotype/ along with the data 
sets used. 

The 4 most frequently appearing rules at level 1 (the most general level in 
the functional catalogue) are all predictors for the class “CELLULAR BIOGEN- 
ESIS”. These rules suggest that sensitivity to zymolase or papulacandin_b, or 
any reaction (sensitivity or resistance) to calcofluor_white is a general property 
of mutants whose deleted genes belong to the CELLULAR BIOGENESIS class. 
All correct genes matching these rules in fact also belong to the subclass “bio- 
genesis of cell wall (cell envelope)”. The rules are far more accurate than the 
prior probability of that class would suggest should occur by chance. 

These are two of the rules regarding sensitivity /resistance to Calcofluor 
White. 

if the gene is sensitive to calcofluor white and 
the gene is sensitive to zymolyase 

then its class is "biogenesis of cell wall (cell envelope) " 
Mean accuracy: 0.909 

Prior prob of class: 0.095 
Std dev accuracy: 0.018 

Mean no. matching genes: 9.3 

if the gene is resistant to calcofluor white 

then its class is "biogenesis of cell wall (cell envelope)" 
Mean accuracy: 0.438 

Prior prob of class: 0.095 
Std dev accuracy: 0.144 

Mean no. matching genes: 6.7 

These rules confirm that Calcofluor White is useful for detecting cell wall 
mutations i2Eiini. Calcofluor White is a negatively charged fluorescent dye that 
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Table 1. Number of genes of unknown function predicted 



Level 1 





std. deviations 


estimated 


from prior 


accuracy 


2 3 


4 


> 80% 


83 72 


35 


> 70% 


209 150 


65 


> 50% 


211 150 


65 



Level 2 





std. deviations 


estimated 


from prior 


accuracy 


2 3 


4 


> 80% 


63 63 


63 


> 70% 


77 77 


77 


> 50% 


133 126 


126 



does not enter the cell wall. Its main mode of action is believed to be through 
binding to chitin and prevention of microfibril formation and so weakening the 
cell wall. The explanation for disruption mutations in the cell wall having in- 
creased sensitivity to Calcofluor White is believed to be that if the cell wall 
is weak, then the cell may not be able to withstand further disturbance. The 
explanation for resistance is less clear, but the disruption mutations may cause 
the dye to bind less well to the cell wall. Zymolase is also known to interfere 
with cell wall formation m- Neither rule predicts the function of any gene of 
currently unassigned function. This is not surprising given the previous large 
scale analysis of Calcofluor White on mutants. 

One rule that does predict a number of genes of unknown function is: 

if the gene is sensitive to hydroxyurea 

then its class is "nuclear orgauiization" 

Mean accuracy: 0.402 

Prior prob of class: 0.215 
Std dev accuracy: 0.066 

Mean no. matching genes: 33.4 

This rule predicts 27 genes of unassigned function. The rule is not of high 
accuracy but it is statistically highly significant. Hydoxyurea is known to inhibit 
DNA replication m, so the rule makes biological sense. 

Table 0 shows the number of genes of unassigned function predicted by the 
learnt rules at levels 1 and 2 in the functional hierarchy. These are plotted as a 
function of the estimated accuracy of the predictions and the significance (how 
many standard deviations the estimated accuracy is from the prior probability 
of the class). These figures record genes predicted by rules that have appeared 
at least 5 times during the bootstrap process. 

It can be seen that analysis of the phenotype growth data allows the predic- 
tion of the functional class of many of the genes of currently unassigned function. 

Table El shows the number of rules found for the classes at level 1. We did 
not expect to be able to learn rules for every class, as some classes may not be 
distinguishable given the growth media that were used. 

Table Q shows some general statistics for the rulesets. Due to the nature of 
the bootstrap method of collecting rules, only average accuracy and coverage 
can be computed (rather than total), as the test data set changes with each 
bootstrap sample. 
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Table 2. Number of rules that appeared more than 5 times at level 1, broken down 
by class. Classes not shown had no rules (2/0/0/0, 8/0/0/0, 10/0/0/0, 13/0/0/0 and 
90/0/0/0) 



number 
of rules 


class no 


class name 


17 


l/O/O/O 


METABOLISM 


32 


3/0/0/0 


CELL GROWTH, CELL DIVISION AND DNA SYNTHESIS 


3 


4/0/0/0 


TRANSCRIPTION 


1 


5/0/0/0 


PROTEIN SYNTHESIS 


2 


6/0/0/0 


PROTEIN DESTINATION 


1 


7/0/0/0 


TRANSPORT FACILITATION 


21 


9/0/0/0 


CELLULAR BIOGENESIS (proteins are not localized to the 
corresponding organelle) 


5 


11/0/0/0 


CELL RESCUE, DEFENSE, CELL DEATH AND AGEING 


77 


30/0/0/0 


GELLULAR ORGANIZATION (proteins are localized to the 
corresponding organelle) 



Table 3. General statistics for rules that appeared more than 5 times. Surprisingly 
high accuracy at level 4 is due to very few level 4 classes, with one dominating class 





no. rules 


no. classes 
represented 


av rule 
accuracy 


average rule 
coverage (genes) 


level 1 


159 


9 


62% 


20 


level 2 


74 


12 


49% 


11 


level 3 


9 


2 


25% 


18 


level 4 


37 


1 


71% 


28 



4 Discussion and Conclusion 

Working with the phenotypic growth data highlighted several learning issues 
which are interesting: 

— We had to extend C4.5 to handle the problem of genes having more than 
one function, the multi-label problem. 

— We needed to select rules for biological interest rather than predicting all 
examples, this required us to use an unusual rule selection procedure, and 
this together with the small size of data set led to our choice of the bootstrap 
to give a clearer picture of the rules themselves. 

Biologically important rules were learnt which allow the accurate predic- 
tion of functional class for ~200 genes. We are in the process of experimentally 
testing these predictions. The prediction rules can be easily comprehended and 
compared with existing biological knowledge. The rules are also useful as they 
show future experimenters which media provide the most discrimination between 
functional classes. Many types of growth media are shown to be highly infor- 
mative for identifying the functional class of disruption mutants (e.g. Calcofluor 
White), others are of little value (e.g. sodium chloride). The nature of the C4.5 
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algorithm is always to choose attributes which split the data in the most in- 
formative way. This knowledge can be used in the next round of phenotypic 
experiments. 

Our work illustrates the value of cross-disciplinary work. Functional genomics 
is enriched by a technique for improved prediction of the functional class of genes: 
and KDD is enriched by provision of new data analysis challenges. 



Acknowledgments. We would like to thank Ugis Sarkans for initial collection 
of the data and Stephen Oliver and Douglas Kell for useful discussions. 

Appendix: Reasoning Behind Multi-class Entropy Formula 

This appendix gives an intuition into the reason for the multi-class entropy 
formula. 

How many bits are needed to describe all the classes an item belongs to? 
For a simple description, we could use a bitstring, 1 bit per class, to represent 
each example. With 4 classes {a,b,c,d}, an example belonging to classes b and d 
could be represented as 0101. But this will usually be more bits than we actually 
need. Suppose every example was a member of class b. In this case we would not 
need the second bit at all, as class b membership is assumed. Or suppose 75% 
of the examples were members of class b. Then we know in advance an example 
is more likely to belong to class b than not to belong. The expected amount of 
information gained by actually knowing whether it belongs or not will be: 

p(belongs) * gain(belongs) -I- p(doesn’t belong) * gain(doesn’t belong) 

= 0.75 * (log 1 - log 0.75) -k 0.25 * (log 1 - log 0.25) 

= - (0.75 * log 0.75) - (0.25 * log 0.25) 

= 0.81 

where gain(x) = information gained by knowing x 

That is, we actually only need 0.81 of a bit to represent the extra information 
we need to know membership or not of class b. Generalising, we can say that 
instead of one bit per class, what we actually need is the total of the extra 
information needed to describe membership or non-membership of each class. 
This sum will be 



N 

+ {q{c^) log q{d))) 

i=l 



where p{ci) is probability of membership of class c, and q{ci) is probability of 
non- membership of class Cj. 
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Abstract. The problem of extracting all association rules from within a binary 
database is well-known. Existing methods may involve multiple passes of the 
database, and cope badly with densely- packed database records because of the 
combinatorial explosion in the number of sets of attributes for which incidence- 
counts must be computed. We describe here a class of methods we have introduced 
that begin by using a single database pass to perform a partial computation of the 
totals required, storing these in the form of a set enumeration tree, which is created 
in time linear to the size of the database. Algorithms for using this structure to 
complete the count summations are discussed, and a method is described, derived 
from the well-known Apriori algorithm. Results are presented demonstrating the 
performance advantage to be gained from the use of this approach. 

Keywords: Association Rules, Set Enumeration Tree, Data Structures. 



1 Introduction 

A well-established approach to Knowledge Discovery in Databases (KDD) involves 
the identification of association rules H within a database. An association rule is a 
probabilistic relationship, of the form A-aB, between sets of database attributes, which 
is inferred empirically from examination of records in the database. In the simplest case, 
the attributes are boolean, and the database takes the form of a set of records each of which 
reports the presence or absence of each of the attributes in that record. The paradigmatic 
example is in supermarket shopping-basket analysis. In this case, each record in the 
database is a representation of a single shopping transaction, recording the set of all 
items purchased in that transaction. The discovery of an association rule, PQR-aXY , 
for example, is equivalent to an assertion that “shoppers who purchase items P, Q and R 
are also likely to purchase items X and Y at the same time”. This kind of relationship is 
potentially of considerable interest for marketing and planning purposes. 

More generally, assume a set I of n boolean attributes, {oi , • • • , a„}. and a database 
table each record of which contains some subset of these attributes, which may equiva- 
lently be recorded as a n-bit vector reporting the presence or absence of each attribute. 
An association rule R is of the form A-aB, where A, B are disjoint subsets of the at- 
tribute set I. The support for the rule R is the number of database records which contain 
Afi B (often expressed as a proportion of the total number of records). The confidence 
in the rule R is the ratio of the support for R to the support for its antecedent, A. A rule 
is described as “frequent” or “interesting”, if it exceeds some defined levels of support 
and confidence. The fundamental problem in association rule mining is the search for 
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sets which exceed the support threshold: once these frequent sets have been identihed 
the conhdence can be immediately computed. 

In this paper we describe a class of methods for identifying frequent sets of attributes 
within a database. For the databases in which we are interested, the number of attributes 
is likely to be 500 or more, making examination of all subsets computationally infeasible. 
Our methods use a single pass of the database to perform a partial summation of support 
totals, with time and space requirements that are linear to the number of database records. 
The partial counts are stored in a set-enumeration tree structure (the ‘P-tree') which 
facilitates efficient completion of the hnal totals required. We describe an algorithm for 
performing this computation, using a second tree structure (the 'T-tree') to store the 
support-counts. Results are presented which illustrate the performance gain achieved by 
this approach. 



2 Background 

The central problem in deriving association rules is the exponential time- and space- 
complexity of the task of computing support counts for all 2" subsets of the attribute 
set I. Hence, practicable algorithms in general attempt to reduce the search space by 
computing support-counts only for those subsets which are identified as potentially 
interesting. The best-known algorithm, “Apriori” ||J|, does this by repeated passes of the 
database, successively computing support-counts for single attributes, pairs, triples, and 
so on. Since any set of attributes can be “interesting” only if all its subsets also reach 
the required support threshold, the candidate set of sets of attributes is pruned on each 
pass to eliminate those that do not satisfy this requirement. Other algorithms, AIS 0 
and SETM |2l, have the same general form but differ in the way the candidate sets are 
derived. 

Two aspects of the performance of these algorithms are of concern: the number of 
passes of the database that are required, which will in general be one greater than the 
number of attributes in the largest interesting set, and the size of the candidate sets which 
may be generated, especially in the early cycles of the algorithm. The number of passes 
may be reduced to 2 by strategies which begin by examining subsets of the database 
ED , or by sampling the database to estimate the likely candidate set O . The drawback 
of these methods is that the candidate set derived is necessarily a superset of the actual 
set of interesting sets, so again the search space may become very large, especially with 
densely packed database records. Large candidate-set sizes create a problem both in 
their storage requirement and in the computation required as each database record is 
examined. The implementation described for the Apriori algorithm stores the candidate 
set in a hash-tree, which is searched for each database record in turn to identify candidates 
that are subsets of the set of attributes included in the record being considered. 

The computation involved in dealing with large candidate sets has led researchers to 
look for methods which seek to identify maximal interesting sets without first examining 
all their smaller subsets. Zaki et al lO do this by partitioning the search space into 
clusters of associated attributes; however, this approach breaks down if the database 
is too densely-populated for such clusters to be apparent. Bayardo’s Max-Miner 
algorithm also searches for maximal sets, using Rymon’s set enumeration framework 
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[iini to order the search space as a tree. Max-Miner reduces the search space by pruning 
the tree to eliminate both supersets of infrequent sets and subsets of frequent sets. In 
a development from Max-Miner, the Dense-Miner algorithm J3 imposes additional 
constraints on the rules being sought to reduce further the search space in these cases. 
These algorithms cope better with dense datasets than the other algorithms described, but 
again require multiple database passes. For databases which can be completely contained 
in main memory, the DepthProject algorithm of [Ilf also makes use of a set- enumeration 
structure. In this case the tree is used to store frequent sets that are generated in depth- first 
order via recursive projections of the database. However, because of the combinatorial 
explosion in the number of candidates which must be considered, and/or the cost of 
repeated access to the database, no existing algorithm copes fully with large databases 
of densely-packed records. 

In the method we describe here, we also make use of Rymon’s set enumeration tree, 
to store interim support- counts in a form that facilitates completion of the computa- 
tion required. The approach is novel but generic in that it can be used as a basis for 
implementing improved variants of many existing algorithms. 

3 Partial Support and the P-Tree 

The most computationally expensive part of Apriori and related algorithms is the iden- 
tification of subsets of a database record that are members of the candidate set being 
considered; this is especially so for records that include a large number of attributes. 
We avoid this, at least initially, by at first counting only sets occurring in the database, 
without considering subsets. 

Let i be a subset of the set I (where / is the set of n attributes represented by the 
database). We define Pi, the partial support for the set i, to be the number of records 
whose contents are identical with the set i. Then Ti, the total support for the set i, can 
be determined as: 



For a database of m records, the partial supports can, of course, be counted simply in 
a single database pass, to produce m' partial totals, for some m' < m. We use Rymon’s 
set enumeration framework 03 to store these counts in a tree; Figure 1 illustrates 
this for I — {A,B,C,D}. To avoid the potential exponential scale of this, the tree 
is built dynamically as the database is scanned so as to include only those nodes that 
represent sets actually present as records in the database, plus some additional nodes 
created to maintain tree structure when necessary. The size of this tree, and the cost of 
its construction are linearly related to m rather than 2". 

Taking advantage of the structural relationships between sets of attributes apparent 
from the tree, we also use the construction phase to begin the computation of total 
supports. As each set is located within the tree during the course of the database pass, it 
is computationally inexpensive to augment interim support-counts, Qi stored for subsets 
which precede it in the tree ordering; thus: 







(Vj, j 3 i, j follows i in lexicographic order) 
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Fig. 1. Tree storage of subsets of {A, B, C, D} 



It then becomes possible to compute total support using the equation: 

Ti = Qi + Pj (Vj, j D i, j precedes i in lexicographic order) 

The numbers associated with the nodes of Fig. 1 are the interim counts which would 
be stored in the tree arising from a database the records of which comprise exactly one 
instance of each of the 16 possible sets of attributes; thus, for example, Q{BC) = 2, 
derived from one instance of BC and one of BCD. Then: 

T{BC) = Q{BC) + P{ABC) + P{ABCD) = Q{BC) + Q{ABC) 

We use the term P-tree to refer to this incomplete set- enumeration tree of interim 
support-counts. An algorithm for building the P-tree, counting the interim totals, is 
described in detail in o. Because the P-tree contains all the relevant data stored in 
the original database, albeit in a different form, we can in principle apply versions of 
almost any existing algorithm to complete the summation of total supports. Use of the P- 
tree as a surrogate for the original database, however, offers three potential advantages. 
Firstly, when n is small (2" <C m), then traversing the tree to examine each node will be 
signihcantly faster than scanning the whole database. Secondly, even for large n, if the 
database contains a high degree of duplication (m' <C m) then using the tree will again 
be signihcantly faster than a full database pass, especially if the duplicated records are 
densely-populated with attributes. Finally, and most generally, the computation required 
in each cycle of the algorithm is greatly reduced because of the partial summation 
already carried out in constructing the tree. For example, in the second pass of Apriori 
(considering pairs of attributes), a record containing r attributes may require the counts 
for each of its r(r — l)/2 subset-pairs to be incremented. When examining a node of 
the P-tree, conversely, it is necessary only to consider only those subsets not already 
covered by a parent node, which in the best case will be only r — 1 subsets. 

To illustrate this, consider the node ABCD in the tree of Fig. 1. The partial total 
for ABCD has already been included in the interim total for ABC, and this will be 
added to the final totals for the subsets of ABC when the latter node is examined. Thus, 
when examining the node ABCD, we need only consider those subsets not covered by 
its parent, i.e. those including the attribute D. The advantage gained from this will be 
greater, of course, the greater the number of attributes in the set being considered. 

A rather similar structure to our P-tree has been described independently by Q ■ This 
structure, the FP-tree, has a different form but quite similar properties to the P-tree, but 
is built in two database passes, the first of which eliminates attributes that fail to reach 
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the support threshold, and orders the others by frequency of occurrence. Each node in 
the FP-tree stores a single attribute, so that each path in the tree represents and counts 
one or more records in the database. The FP-tree also includes more structural infor- 
mation, including all the nodes representing any one attribute being linked into a list. 
This structure facilitates the implementation of an algorithm, “FP-growth”, which suc- 
cessively generates subtrees from the FP-tree corresponding to each frequent attribute, 
to represent all sets in which the attribute is associated with its predecessors in the tree 
ordering. Recursive application of the algorithm generates all frequent sets. The two 
structures, the FP-tree and our P-tree, which have been developed independently and 
contemporaneously, are sufficiently similar to merit a detailed comparison, which we 
discuss in Sect. 5. 

4 Computing Total Supports 

The construction of the P-tree has essentially performed, in a single pass, a reorganisation 
of the relevant data into a structured set of counts of sets of attributes which appear as 
distinct records in the database. For any candidate set T of subsets of /, the calculation of 
total supports can be completed by walking this tree, adding interim supports as required 
according to the formulae above. 

We can also take advantage of the structure of the P-tree to organise the computation 
of total supports efficiently, taking advantage of the fact that the counts for each set in 
the P-tree already incorporate contributions from their successor-supersets. Figure 2 
illustrates the dual of Fig. 1, in which each subtree includes only supersets of its root 
node which contain an attribute that precedes all those of the root node. We will call this 
the T-tree, representing the target sets for which the total support is to be calculated, 
as opposed to the interim-support P-tree of Fig. 1. Observe that for any node t in the 
T-tree, all the subsets of t which include an attribute i will be located in that segment of 
the tree found between node i and node t in the tree ordering. This allows us to use the 
T- tree as a structure to effect an implementation of an algorithm to sum total supports: 



Algorithm TFP (Compute Total- from Partial- supports) 
for each node j in P-tree do 
begin k = j - parent (j); 

i = first attribute in fc; 
starting at node i of T -tree do 
begin if i C j then add Qj to p ; 

if i= j then exit 
else recurse to child node; 
proceed to sibling node; 

end 

end 



To illustrate the application of the algorithm, consider the node ACD in the tree 
of Fig. 1. TFP first obtains the difference of this node from its parent, AC, i.e. D, and 
begins traversing the T-tree at node D. From this point the count associated with ACD 
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Fig. 2. Tree with predecessor-subtrees 



will be added to all nodes encountered that are subsets of ACD, i.e. D, AD, CD and 
ACD, the traversal terminating when the node ACD is reached. Note that the count for 
the node BD which is not a subset of ACD will not be updated, nor will its subtree be 
traversed. 

Of course, to construct the entire T -tree would imply an exponential storage require- 
ment. In any practical method, however, it is only necessary to create that subset of the 
tree corresponding to the current candidate set being considered. Thus, for example, a 
version of the Apriori algorithm using these structures would consider candidates which 
are singletons, pairs of attributes, triples, etc., in successive passes. This algorithm, which 
we will call Apriori-TFP, has the following form: 

1. Build level K in the T-tree. 

2. “Walk” the P-tree, applying algorithm TFP to add interim supports associated with 
individual P-tree nodes to the level K nodes established in (1) . 

3. Remove any level K T-tree nodes that do not have an adequate level of support. 

4. Repeat steps (1), (2) and (3); until a level K is reached where no nodes are adequately 
supported. 

The algorithm begins by constructing the top level of the T-tree, containing all the 
singleton subsets, i.e. the single attributes in I. A first pass of algorithm TFP then counts 
supports for each of these in a single traversal of the P-tree. Note again that identification 
of the relevant nodes in the T -tree is trivial and efficient, as these will be located in a 
(usually short) segment of the level- 1 list. In practice, it is more efficient to implement 
level 1 of the T -tree as a simple array of attribute-counts, which can be processed more 
quickly than is the case for a list structure. A similar optimisation can be carried into 
level 2, replacing each branch of the tree by an array, and again this is likely to be more 
efficient when most of the level 2 nodes remain in the tree. 

Following completion of the first pass, the level 1 T-tree is pruned to remove all 
nodes that fail to reach the required support threshold, and the second level is generated, 
adding new nodes only if their subsets are contained in the tree built so far, i.e. have been 
found to have the necessary threshold of support. The new level of the tree forms the 
candidate set for the next pass of the algorithm TFP. The complete algorithm is described 
formally in Table 1 (Part 1) and Table 2 (Part 2). This uses a function, endDigits, that 
takes two arguments P and N (where N is the current level) and returns a set comprising 
the last N attributes in the set P; thus endDigits{ABC, 2) = BC. The significance of 
this is that BC is the last subset of ABC at level 2 that need be considered. 
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Table 1. Total Support Algorithm (Part 1) 

VP € Ptree where (numAttributes{P) > requiredLevel) 

P' = P \ Pparent 
VTij- {nodes at level 1) 
loop while P' 7 ^ null 

*/ < P' j + + 

if = P' 

if {requiredLevel = 1) Tsup = Tsup + Psup 
else Part 2 
P' = null 
if {Ti,j C P') 

if {requiredLevel = 1) Tsup = Tsup + Psup 
else Part 2 

P' = P' \ firstAttribute{P') ■ j + + 



Table 2. Total Support Algorithm (Part 2) 

P" = endDigits{P, currentLevel) 
loop while Tij ^ null 
if Ti,j < P" 
ifiTij C P) 

if currentLevel = requiredLevel Tsup = Tsup + Psup 
else recursively call Part 2 commencing with Pi++,i 

j + + 

if Tj = P" 

CUVT'G'TltLG'VG-l T'G-ZJUZT'C'dLG'VG'l T-'sup ■ — T'sup Psup 
else recursively call Part 2 commencing with Pi++,i 
stop 

if Ti,j > P" 
stop 



5 Results 



To evaluate the algorithms we have described, we have compared their performance with 
that for our implementations of two published methods: the original Apriori algorithm 
(founded on a hash tree data structure), and the FP- growth algorithm described in 0. 
In both cases, the comparisons are based on our own implementations of the algorithms, 
which follow as closely as we can judge the published descriptions. All the implementa- 
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tions, including those for our own algorithms, are experimental prototypes, unoptimised 
low-performance Java programs. 

The first set of experiments illustrate the performance characteristics involved in the 
creation of the P-tree. Figure 3 shows the time to build the P-tree, for databases of 
200,000 records with varying characteristics. The graphs of storage requirements also 
have exactly the same pattern. The three cases illustrated represent synthetic databases 
constructed using the QUEST generator described in 01. This uses parameters T, which 
defines the average number of attributes found in a record, and I, the average size of the 
maximal supported set. Higher values of T and I in relation to the number of attributes 
N correspond to a more densely-populated database. These results show that the cost of 
building the P-tree is almost independent of N, the number of attributes. As is the case 
for all association-rule algorithms, the cost of the P-tree is greater for more densely- 
populated data, but in this case the scaling appears to be linear. 

Figure 4 examines the P-tree storage requirement for databases of 500 attributes, 
with the same sets of parameters, as the number of database records is increased. This 
shows, as predicted, that the size of the tree is linearly related to the database size. Again, 
this is also the case for the construction time. The actual performance figures for the 
P- tree construction could easily be improved from a more efficient implementation, 
and it would also be possible and probably worthwhile to use this first pass to compute 
total support counts for the single attributes. However, the construction of the P-tree 
is essentially a restructuring of the database, the effect of which will be realised in all 
subsequent data mining experiments. 

In Table 3 we examine the cost of building the P-tree in comparison with that for 
the F P-tree of [Sj. The figures tabulated are for two different datasets: 

1. quest.T25.I10.NlK.D10K: A synthetic data set, also used in generated using 
the Quest generator (N=1000 attributes, D=10000 records). 

2. fleet.N194.D9000: A genuine data set, not in the public domain, provided by a UK 
insurance company. Note that this set is much denser than quest.T25.I10.NlK.D10K 



Time (mins) 




Fig. 3. Graph showing effort (time) to generate P-tree for data sets with number of rows fixed at 
200000 
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storage (MU 




Fig. 4. Graph showing P-tree storage requirements for data sets with number of columns fixed at 
500 



Table 3. P-tree and FP-tree generation characteristics 





quest. T25.I10.N1K.D10K 


fleet.Nl94.D9000 


Storage (Bytes) 


Time (Mins) 


Storage (Bytes) 


Time (Mins) 


P-tree 


1,020,690 


0.65 


582,196 


0.36 


FP-tree (Sup 5%) 


1,566,838 


3.43 


767,062 


1.24 


FP-tree (Sup 4%) 


2,283,360 


5.53 


912,918 


1.39 


FP-tree (Sup 3%) 


3,028,082 


9.36 


1,334,146 


2.04 


FP-tree (Sup 2%) 


3,974,482 


20.00 


1,704,990 


3.28 


FP-tree (Sup 1%) 


4,567,480 


34.83 


1,754,990 


3.35 



With respect to Table 3 it should be noted that the procedure for building the FP- 
tree eliminates all single attributes that fail to reach the support threshold, so figures 
for a range of support thresholds are tabulated against the (constant) characteristics of 
the P-tree. As can be seen, the P-tree is a significantly more compact structure, and 
its construction time lower than that of the FP-tree. The greater size of the FP-tree 
arises from the greater number of nodes it creates, and the additional links required by 
the FP-growth algorithm. The FP-tree stores each attribute of a record as a separate 
node, so that, for example, two records ABCDE and ABCXY, with a common prefix 
ABC, would require in all 7 nodes. The P-tree, conversely, would create only 3 nodes: 
a parent ABC, and child nodes DE and XY. Each node in the FP-tree also requires 
two additional links not included in the P-tree. One, the “node-link”, connects all nodes 
representing the same attribute, and the other, which links a node to its parent, appears 
to be necessary to effect an implementation of FP-growth. The greater construction time 
for the FP-tree is unsurprising, given its more complex structure and that it requires 
two passes of the source data. In these trials, this data is main-memory resident: in the 
case of a dataset too large for this to be possible, the cost of the additional pass would 
of course be much greater. 

Finally, to evaluate the performance of the method for computing final support- 
counts, we have compared the Apriori-TFP algorithm we have described with our im- 
plementations of the original Apriori (founded on a hash tree data structure) and of 
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Timt (mint) 




Fig. 5. Graph showing processing time to mine (1) the P-tree, (2) the FP-tree and (3) to perform 
the same operation using a traditional Apriori algorithm using quest.T25.I10.NlK.D10K 



Time (mins) 




Fig. 6. Graph showing processing time to mine (1) the P-tree, (2) the FP-tree and (3) to perform 
the same operation using a traditional Apriori algorithm using fleet.N194.D9000 



FP-growth. The results are presented in Fig. 5, for quest.T25.I10.NlK.D10K and Fig. 6, 
for fleet.N194.D9000, In all cases, to give the fairest basis for comparison, we have 
used data which is main-memory resident throughout. The performance time presented 
with respect to Apriori-TFP and FP-growth do not include the time to produce the P- 
tree or FP-tree respectively. 

As we would expect, Apriori-TFP strongly outperforms our implementation of Apri- 
ori. This improvement arises from a combination of two factors. The hrst, as described 
above, is the lower number of support-count updates that will be required when exam- 
ining a P-tree node, as opposed to the number required in Apriori from examination of 
the records from which the node is made up. This gain will be greatest when there are 
clusters of records including signihcant numbers of shared attributes (as we might hope 
to hnd when mining potentially interesting data), and, especially, if there are significant 
numbers of duplicated records. Secondly, the effect is compounded by the more efficient 
localisation of candidates obtained by using the T -tree for the TFP algorithm, as opposed 
to the hash-tree used by Apriori. The cost of accessing the hash-tree to locate candidates 
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for updating increases rapidly as the candidate set increases in size, as is the case for 
lower support thresholds, and is greatest when examining a record which includes many 
attributes and hence many potential candidates for updating. 

Apriori-TFP also outperforms our implementation of FP- growth using quest.T25. 
IIO.NIK.DIOK although the difference here is much less. We believe that the perfor- 
mance gain here is a consequence of the cost of the recursive construction of successive 
conditional FP-trees, which, in our straightforward implemention, is much slower than 
the simple iterative building of the T-tree. In the case of the fleet.N194.D9000 data set 
similar performance times for both Apriori-TFP and FP growth are recorded with one 
outperforming the other on some occasions and vice versa. However, if the P-tree/FP-tree 
generation times are included Apriori- TFP clearly outperforms FP-growth. 

Althoughit is possible that some of the advantage is an artefact of our implemen- 
tations, the results appear to show that the simpler P-tree structure offers at least as 
good performance as the more complex PP-tree. Moreover, the above experiments use 
memory-resident data only; we believe that the additional structural links in the FP-tree, 
and the need for repeated access to generate subtrees, will create problems for efficient 
implementation in cases for which the tree is too large to hold in main memory. For 
the simpler P-tree structure, conversely, it is easy to describe an efficient construction 
process which will build separate trees for manageable segments of the database, prior 
to a final merging into a single tree. Nor is it necessary, in general, for the P-tree to 
be retained in main memory throughout the calculation of final support totals. The only 
structural information necessarily retained is the relationship of a node to its parent. For 
example, if a node representing the set ABDFG is present in the tree as a child of the 
node ABD, all the relevant information can be recorded by a node representation of 
the form ABD.FG. In this form, the “tree” can in fact be stored finally as a simple 
array in any convenient order, depending on the needs of the algorithm to compute the 
final supporf totals. In the case of the Apriori-TFP algorithm, the tree/array is processed 
element-by-element in any order, causing no problems even when it is necessary to hold 
it in secondary memory. 



6 Conclusions 

We have presented here an algorithm for computing support counts using as a starting 
point an initial, incomplete computation stored as a set- enumeration tree. Although 
the actual algorithm described here to compute the final totals is based on the Apriori 
algorithm, the method itself is generic, in that, once the P-tree has been created, a variety 
of methods may be applied to complete the summation. Many of these methods, like the 
one we have illustrated, will be able to take advantage of the partial computation already 
carried out in the initial database pass to reduce the cost of further multiple passes. 

Note, however, that the advantage gained from this partial computation is not equally 
distributed throughout the set of candidates. For candidates early in the lexicographic 
order, most of the support calculation is completed during the construction of the P-tree; 
for example, for the attributes of Fig. 1, support for the sets A, AB, ABG and ABGD 
will be counted totally in this first stage of the summation. This observation allows us 
to consider methods which maximise the benefit from this by a suitable ordering of the 
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attribute set. This is, of course, the heuristic used by [^, and also, in various ways, by 
0, @ and JT). 

We could also increase the proportion of the summation which is completed during 
the initial scan of the database by a partitioning of the P-tree. For example, it would 
be possible to separate the tree of Fig. 1 into four subtrees, rooted at the nodes A, B, 
C and D, and for the first pass to accumulate interim supports within each of these 
subtrees independently. In this case, a record containing the set ABD, for example, 
would increment the support-counts for ABD within the A-trse, BD within the P-tree, 
and D within the (single-node) P-tree. Again, the effect of this is similar to that for 
a set of conditional FP-trees produced by FP-growth. The advantage offered is that is 
provides a means of reducing the size of trees required for processing. The size of the 
complete subtree corresponding to an attribute ai that is in position i in the tree ordering 
is However, the P-tree construction method we use will produce an incomplete 
subtree, the size of which will be of order to', where to' < Ta^, the number of records 
in the database which contain ai (again, reduced by the existence of duplicates). Thus, 
the storage requirement for each subtree is less than or equal to min Ta^}. The 

requirement for any single subtree can be minimised by ordering the attributes in reverse 
order of their frequency, so that the most common attributes are clustered at the high- 
order end of the tree structure. 

Partitioning the tree in this way would allow us (in one pass) to organise the data 
into sets each of which can be processed independently and may be small enough to 
be retained in central memory. At the high-order end of the organisation, i.e. for values 
of i close to n, the 2”“* limit becomes computable. Thus, for large i, it may be more 
efficient to store partial supports in a complete array of subset-counts, and to use an 
exhaustive algorithm to compute total supports efficiently. Conversely, for smaller i, 
the conservative P-tree storage method, and an algorithm such as Apriori-TFP can be 
applied. We are presently investigating this and other heuristics to produce effective 
hybrid algorithms of this kind. 



References 

1. Agarwal, R., Aggarwal, C. and Prasad, V. Depth First Generation of Long Patterns. Proc ACM 
KDD 2000 Conference, Boston, 108-118, 2000. 

2. Agrawal, R. Imielinski, T. Swami, A. Mining Association Rules Between Sets of Items in 
Large Databases. SIGMOD-93, 207-216. May 1993. 

3. Agrawal, R. and Srikant, R. Fast Algorithms for Mining Association Rules. Proc 20th VLDB 
Conference, Santiago, 487-499. 1994 

4. Bayardo, R.l. Efficiently Mining Long Patterns from Databases. Proc ACM-SIGMOD Int 
Conf on Management of Data, 85-93, 1998 

5. Bayardo, R.J., Agrawal, R. and Gunopolos, D. Constraint-based rule mining in large, dense 
databases. Proc 15th Int Conf on Data Engineering, 1999 

6. Brin, S., Motwani. R., Ullman, l.D. and Tsur, S. Dynamic itemset counting and implication 
rules for market basket data. Proc ACM SIGMOD Conference, 255-256, 1997 

7. Goulbourne, G., Coenen, F. and Leng, P. Algorithms for Computing Association Rules using 
a Partial-Support Tree. J. Knowledge-Based Systems 13 (2000), 141-149. (also Proc ES’99.) 

8. Han, 1., Pei, J. and Yin, Y. Mining Frequent Patterns without Candidate Generation. Proc 
ACM SIGMOD 2000 Conference, 1-12, 2000. 



66 



F. Coenen, G. Goulbourne, and P. Leng 



9. Houtsma, M. and Swami, A. Set-oriented mining of association rules. Research Report RJ 
9567, IBM Almaden Research Centre, San Jose, October 1993. 

10. Rymon, R. Search Through Systematic Set Enumeration. Proc. 3rd IntT Conf. on Principles 
of Knowledge Representation and Reasoning, 1992, 539-550. 

11. Savasere, A., Omiecinski, E. and Navathe, S. An efficient algorithm for mining association 
rules in large databases. Proc 21st VLDB Conference, Zurich, 432-444. 1995. 

12. Toivonen, H. Sampling large databases for association rules. Proc 22nd VLDB Conference, 
134-145. Bombay, 1996. 

13. Zaki, M.J., Parthasarathy, S. Ogihara, M. and Li, W. New Algorithms for fast discovery of as- 
sociation rules. Technical report 65 1 , University of Rochester, Computer Science Department, 
New York. July 1997. 




Gaphyl: A Genetic Algorithms Approach to 

Gladistics 



Clare Bates Congdon 



Department of Computer Science, Colby College, 5846 Mayflower Hill Drive, 
Waterville, ME 04901, USA 
congdonScolby . edu 
http : //www. cs . colby.edu/~congdon 



Abstract. This research investigates the use of genetic algorithms to 
solve problems from cladistics - a technique used by biologists to hypoth- 
esize the evolutionary relationships between organisms. Since exhaustive 
search is not practical in this domain, typical cladistics software packages 
use heuristic search methods to navigate through the space of possible 
trees in an attempt to find one or more “best” solutions. We have devel- 
oped a system called Gaphyl, which uses the genetic algorithm approach 
as a search technique for finding cladograms, and a tree evaluation met- 
ric from a common cladistics software package (Phylip). On a nontrivial 
problem (49 species with 61 attributes), Gaphyl is able to find more of 
the best known trees with less computational effort than Phylip is able to 
End (corresponding to more equally plausible evolutionary hypotheses). 



1 Introduction 

The human genome project and similar projects in biology have led to a wealth 
of data and the rapid growth of the emerging field of bioinformatics, a hybrid dis- 
cipline between biology and computer science that uses the tools and techniques 
of computer science to help manage, visualize, and find patterns in this wealth 
of data. The work reported here is an application to biology, and indicates gains 
from using genetic algorithms (GA’s) as the search mechanism for the task. 

Cladistics |E] is a method widely used by biologists to reconstruct hypoth- 
esized evolutionary pathways followed by organisms currently or previously in- 
habiting the Earth. Given a dataset that contains a number of different species 
(also called taxa), each with a number of attribute- values (also called charac- 
ter states), cladistics software constructs cladograms (also called phylogenies), 
which are representations of the possible evolutionary relationships between the 
given species. A typical cladogram is a tree structure: The root of a tree can be 
viewed as the common ancestor, the leaves of a tree are the species, and subtrees 
are subsets of species that share a common ancestor. Each branching of a parent 
node into offspring represents a divergence in one or more attribute-values of 
the species within the two subtrees. In an alternate approach, sometimes called 
“unrooted trees” (and sometimes called “networks”), the root of the tree is not 
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assumed to be an ancestral state, and drawing the structures as trees seems to 
be primarily a convenience for the software authors. 

Cladograms are evaluated using metrics such as parsimony: A tree with fewer 
evolutionary steps is considered better than one with more evolutionary steps. 
The work reported here used Wagner parsimony, though there are other pos- 
sibilities, such as Camin-Sokal parsimony. (See |1()| or for example, for a 
discussion of alternatives.) Wagner parsimony is straightforward to compute 
(requiring only a single pass through the tree) and incorporates few constraints 
on the evolutionary changes that will be considered. (For example, some par- 
simony approaches require the assumption that characters will be added, but 
not lost, through the evolutionary process.) An unrooted tree evaluated with 
Wagner parsimony can be rooted at any point without altering the evaluation 
of the tree. 

The typical cladistics software approach uses a deterministic hillclimbing 
methodology to find a cladogram for a given dataset, saving one or more “most 
parsimonious” trees as the result of the process. (The most parsimonious trees 
are the ones with a minimum number of evolutionary changes connecting the 
species in the tree. Multiple “bests” correspond to equally plausible evolutionary 
hypotheses, and finding more of these competing hypotheses is an important 
part of the task.) The tree-building approach adds each species into the tree in 
sequence, searching for the best place to add the new species. The search process 
is deterministic, but different trees may be found by running the algorithm with 
different random “jumbles” of the order of the species in the dataset. 

The genetic algorithm (GA) approach to problem solving has shown improve- 
ments to hillclimbing approaches on a wide variety of problems EHH. In this 
approach, a population of possible solutions to the problem “breed” , producing 
new solutions; over a number of “generations”, the population tends to include 
better solutions to the problem. The process uses random numbers in several 
different places, as will be discussed later. 

This research is an investigation into the utility of using the genetic algorithm 
approach on the problem of finding parsimonious cladograms. 



2 Design Decisions 

To hasten the development of our system, we used parts of two existing software 
packages. Phylip |3| is a cladistics system widely used by biologists. In particular, 
this system contains code for evaluating the parsimony of the cladograms (as well 
as some helpful utilities for working with the trees). Using the Phylip source code 
rather than writing our own tree-evaluation modules also helps to ensure that our 
trees are properly comparable to the Phylip trees. Genesis ^ is a GA package 
intended to aid the development and experimentation with variations on the 
GA. In particular, the basic mechanisms for managing populations of solutions 
and the modular design of the code facilitate implementing a GA for a specific 
problem. We named our new system Gaphyl, a reflection of the combination of 
GA and Phylip source code. 
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The research described here was conducted using published datasets available 
over the internet 0, and was done primarily with the families of the superorder 
of Lamiiflorae dataset, consisting of 23 species and 29 attributes. This dataset 
was chosen as being large enough to be interesting, but small enough to be 
manageable for this initial study. A second dataset, the major clades of the 
angiosperms, consisting of 49 species and 61 attributes, was used for further 
experimentation. 

These datasets were selected because the attributes are binary, which sim- 
plified the tree-building process. As a preliminary step in evaluating the GA as 
a search mechanism for cladistics, “unknown” values for the attributes were re- 
placed with I’s to make the data fully binary. This minor alteration to the data 
does impact the meaningfulness of the resulting cladograms as evolutionary hy- 
potheses, but does not affect the comparison of Gaphyl and Phylip as search 
mechanisms. 



3 The Genetic Algorithm Approach 

There are many variations on the GA approach0, but a standard methodology 
proceeds as follows: 

1. Generate a population of random solutions to the problem. (These are not 
assumed to be particularly good solutions to the problem, but serve as a 
starting point.) 

2. The GA proceeds through a number of “generations”. In each generation: 

a) Assign a “fitness” to each solution, so that we know which solutions are 
better than others. 

b) Select a “parent” population through a biased random (with replace- 
ment) process, so that higher fitness solutions are more likely to be par- 
ents. 

c) Use operators such as crossover, which combines parts of two parent 
solutions to form new solutions, and mutation, which randomly changes 
part of a solution, to create a new population of solutions. 

The algorithm terminates after a predetermined number of generations or 
when the solutions in the population have converged within a preset criterion 
(that is, until they are so similar that little is gained from combining parents to 
form new solutions). 

Several factors should be evaluated when considering the utility of GA’s for 
a particular problem: 

1. Is there a more straightforward means of finding a “best” solution to the 
problem? (If so, there is no point in using the GA approach.) 

^ As is the custom in the evolutionary computation community, the author distin- 
guishes different forms of evolutionary computation, and is working specifically 
within the “genetic algorithms” framework. 
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2. Can potential solutions to the problem be represented using simple data 
structures such as bit strings or trees? (If not, it may be difficult to work 
with the mechanics of the GA.) 

3. Can a meaningful evaluation metric be identified that will enable one to 
rate the quality of each potential solution to your problem? (Without such a 
measure, the GA is unable to determine which solutions are more promising 
to work with.) 

4. Can operators be devised to combine parts of two “parent” solutions and 
produce (viable) offspring solutions? (If the offspring do not potentially re- 
tain some of what made the parents “good”, the GA will not be markedly 
better than random trial and error.) 

In the cladistics task, there is a standard approach to forming the cladograms, 
but that process also has a stochastic element, so the standard approach is not 
guaranteed to find “the best” cladogram for a given dataset. In the cladistics 
task, solutions to the problem are naturally represented as trees. In addition, a 
standard metric for evaluating a given tree is provided with the task (parsimony). 
However, there is a challenge for implementing the cladistics task using the GA 
approach: devising operators that produce offspring from two parent solutions 
while retaining meaningful information from the parents. 



4 The GA for Cladistics 

The typical GA approach to doing “crossover” with two parent solutions with a 
tree representation is to pick a subtree (an interior or root node) in both parents 
at random and then swap the subtrees to form the offspring solution. The typical 
mutation operator would select a point in the tree and mutate it to any one of the 
possible legal values (here, any one of the species). However, these approaches 
do not work with the cladistics trees because each species must be represented 
in the tree exactly once. 

4.1 Crossover Operator 

The needs for our crossover operator bear some similarity to traveling salesperson 
problems (TSP’s), where each city is to be visited exactly once on a tour. There 
are several approaches in the literature for working on this type of problem with 
a GA, however, the TSP naturally calls for a string representation, not a tree. 
In designing our own operator, we studied TSP approaches for inspiration, but 
ultimately devised our own. We wanted our operator to attempt to preserve 
some of the species relationships from the parents. In other words, a given tree 
contains species in a particular relationship to each other, and we would like to 
retain a large degree of this structure via the crossover process. 

Our crossover operator proceeds as follows: 

1. Ghoose a species at random from one of the parent trees. Select a subtree 
at random that includes this node, excluding the subtree that is only the 
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Fig. 1. Two example parent trees for a cladistics problem with seven species. A subtree 
for crossover has been identified for each tree. 





Fig. 2. At the left, the offspring initially formed by replacing the subtree from parent 1 
with the subtree from parent2; on the right, the offspring tree has been pruned to 
remove the duplicate species F. 



leaf node and the subtree that is the entire tree. (The exclusions prevent 
meaningless crossovers, where no information is gained from the operation.) 

2. In the second parent tree, find the smallest subtree containing all the species 
from the first parent’s subtree. 

3. To form an offspring tree, replace the subtree from the first parent with the 
subtree from the second parent. The offspring must then be pruned (from 
the “older” branches) to remove any duplicate species. 

4. Repeat the process using the other parent as the starting point, so that this 
process results in two offspring trees from two parent trees. 

This process results in offspring trees that retain some of the species rela- 
tionships from the two parents, and combine them in new ways. An example 
crossover is illustrated in Figs. 1 and 2. The parents are shown in Fig. 1; Fig. 2 
shows first the offspring formed via the crossover operation and identifies the 
subtree that must now be pruned and second shows the resulting offspring (af- 
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ter pruning species F). (Note that in the cladograms, swapping the left and right 
children does not affect the meaning of the cladogram.) 

4.2 Mutation Operator 

The typical GA “mutation” operator takes a location in the solution at ran- 
dom and mutates it to some other value. Again, the standard operator was not 
suited to our representation, where each species must appear exactly once in the 
tree. Instead, for our mutation operator, we selected two leaf nodes (species) at 
random, and swapped their positions in the tree. 

4.3 Canonical Form 

The Wagner parsimony metric uses “unrooted” trees, leading to many different 
possible representations of “the same” cladogram that are anchored at different 
points. Furthermore, flipping a tree (or subtree) left to right (switching the left 
and right subtrees) does not alter the parsimony of a cladogram (nor represent 
an alternative evolutionary hypothesis). Therefore, it soon became clear that 
Gaphyl would benefit from a canonical form, that could be applied to trees 
to ascertain whether trees in the population represented the same or distinct 
cladograms. 

The canonical form we instituted picks the first species in the data set to be 
an offspring of the root, and “rotates” the tree (and flips, if necessary) to keep 
the species relationships in tact, but to reroot the tree at a given species. (To 
simplify comparisons, we followed the default Phylip assumption of making the 
first species in the dataset the direct offspring of the root of the tree.) Secondly, 
the subtrees are (recursively) rearranged so that left subtrees are smaller (fewer 
nodes) than right subtrees and that when left and right subtrees have the same 
number of nodes, a preorder traversal of the left subtree is alphabetically before 
a preorder traversal of the right subtree. This process is carried out when saving 
the “best” trees found in each generation, to ensure that no equivalent trees are 
saved among the best ones. Ganonical form is illustrated in Fig. 01 

4.4 A Second Mutation Operator 

The addition of a canonical form suggested the design of a second mutation 
operator. The relationships between species in a subtree is potentially useful 
information for offspring to inherit from parents. But perhaps the subtrees should 
be connected differently. The second mutation operator picks a random subtree 
and a random species within the subtree. The subtree is rotated to have the 
species as the left child of the root and reconnected to the parent. 

4.5 Immigration 

Early runs with Gaphyl on the larger dataset yielded trees with a parsimony 
of 280, but not 279 (lower parsimony is better). Reflection on the process and 
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Fig. 3. An illustration of putting a tree into canonical form. The tree starts as in the 
top left; an alternate representation of the tree as a “network” is shown at the bottom 
left. First, the tree is rotated, so that the first species is an offspring of the root. Second, 
subtrees are rearranged so that smaller trees are on the left and alphabetically lower 
species are on the left. 



inspection of the population determined that the process seemed to be converging 
too rapidly - losing the diversity across individuals that enables the crossover 
operator to find stronger solutions. “Premature convergence” is a known problem 
in the GA community, and there are a number of good approaches for combatting 
it. In Gaphyl, we opted to implement parallel populations with immigration. 
Adding immigration to the system allowed Gaphyl to find the trees of fitness 
279. 

The immigration approach implemented here is fairly standard. The popu- 
lation is subdivided into a specified number of subpopulations which, in most 
generations, are distinct from each other (crossovers happen only within a given 
subpopulation). After a number of generations have passed, each population 
migrates a number of its individuals into other populations; each emmigrant 
determines at random which population it will move to and which tree within 
that population it will uproot. The uprooted tree replaces the emmigrant in the 
emmigrant’s original population. The number of populations, the number of gen- 
erations to pass between migrations, and the number of individuals from each 
population to migrate at each migration event are, of course, all determined by 
parameters to the system. 

5 Experimental Results 



Recall that both Gaphyl and Phylip have a stochastic component, which means 
that evaluating each system requires doing a number of runs. In Phylip, each 
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distinct run first “jumbles” the species list into a different random order. In 
Gaphyl, there are many different effects of random number generation: the con- 
struction of the initial population, parent selection, and the selection of crossover 
and mutation points. For both systems, a number of different runs must be done 
to evaluate the approach. 

5.1 Comparison of Gaphyl and Phylip 

1. With the Lamiiflorae data set, the performance of Gaphyl and Phylip is 
comparable. Phylip is more expedient in finding a single tree with the best 
parsimony (72), but both Gaphyl and Phylip find 45 most parsimonious 
cladograms in about twenty minutes of run time. 

2. With the angiosperm dataset, a similar pattern emerges: Phylip is able to 
find one tree with the best fitness (279) quite quickly, while Gaphyl needs 
more run time to first discover a tree of fitness 279. However, in a comparable 
amount of runtime, Gaphyl is able to find 250 different most parsimonious 
trees of length 279 (approximately 24 hours of runtime). Phylip runs for 
comparable periods of time have not found more than 75 distinct trees with 
a parsimony of 279. 

In other words, Gaphyl is more successful than Phylip in finding more trees 
(more equally plausible evolutionary hypotheses) in the same time period. 

The first task is considerably easier to solve, and Gaphyl does not require 
immigration to do so. Example parameter settings are a population size of 500, 
500 generations, 50% elitism (the 250 best trees are preserved into the next 
generation), 100% crossover, 10% first mutation, and 100% second mutation. 
Empirically, it appears that 72 is the best possible parsimony for this dataset, 
and that there are not more than 45 different trees of length 72. 

The second task, as stated above, seems to require immigration in order for 
Gaphyl to find the best known trees (fitness 279). Successful parameter settings 
are 5 populations, population size of 500 (in each subpopulation), 2000 genera- 
tions, immigration of 5% (25 trees) after every 500 generations, 50% elitism (the 
250 best trees are preserved into the next generation), 100% crossover, 10% first 
mutation, and 100% second mutation. (Immigration does not happen following 
the final generation.) 

We have not yet done enough runs with either Phylip or Gaphyl to estimate 
the maximum number of trees at this fitness, nor a more concise estimate of how 
long Phylip would have to run to find 250 distinct trees, nor whether 279 is even 
the best possible parsimony for this dataset H 

Based on these initial experiments, the pattern that is emerging is that as 
the problems get more complex, Gaphyl is able to find a more complete set of 
trees with less work than what Phylip is able to find. The work done to date 

^ We note that we inadvertently capped the number of trees that Gaphyl is able to 
find in setting our elitism rate. With a population size of 500 and 50% elitism, 250 is 
the maximum number of distinct trees that will be saved from one generations into 
the next. 
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illustrates that Gaphyl is a promising approach for cladistics work, as Gaphyl 
finds a wider variety of trees on this problem than Phylip does. This further 
suggests that Gaphyl may be able to find solutions better than Phylip is able 
to find on datasets with a larger number of species and attributes, because it 
appears to be searching more successful regions of the search space. 



5.2 Evaluation of Contribution of Operators 



To evaluate the contributions of the GA operators to the search, additional runs 
were done with the first data set (and no immigration). Empirically, crossover 
and the second mutation operator had been found to be the largest contribu- 
tors to successful search, so attention was focused on the contributions of these 
operators. 

In the first set of experiments, the first mutation rate was set to be 0%. First, 
the crossover rate was varied from 0% to 100% at increments of 10% while the 
second mutation rate was held constant at 100%. Second, the second mutation 
rate was varied from 0% to 100% at increments of 10% while the crossover rate 
was held constant at 100%. 20 experiments were run at each parameter setting; 
500 generations were run. 

Figure E] illustrates the effects of varying the crossover rate (solid line) and 
second mutation rate (dashed line) on the average number of generations taken 
to find at least one tree of the known best fitness (72). Experiments that did 
not discover a tree of fitness 72 are averaged in as taking 500 generations. For 
example, 0% crossover was unable to find any trees of the best fitness in all 20 
experiments, and so its average is 500 generations. 

This first experiment illustrates that in general, higher crossover rates are 
better. There is not a clear preference, however, for high rates of the second 
form of mutation. To look at this operator more closely, the final populations of 
the 20 experiments were looked at to determine how many of the best trees were 
found in each run. 

Figure 0 illustrates the effects of varying the crossover rate (solid line) and 
second mutation rate (dashed line) on the average number of best trees found. 
Experiments that did not discover a tree of fitness 72 are averaged in as finding 0 
trees. For example, 0% crossover was unable to find any trees of the best fitness 
in all 20 experiments, and so its average is 0 of the best trees. 

As Fig. El illustrates, runs with a higher second mutation rate tend to find 
more of the best trees than runs with a lower second mutation rate. 

The impact of the first mutation operator had seemed to be low based on 
empirical evidence. So another set of experiments was done to assess the con- 
tribution of this operator. Figure 0 illustrates two sets of experiments. In both, 
the crossover rate was set at 100%; in one, the second mutation rate was set 
at 0% and in the other, the second mutation rate was set at 100%. The figure 
illustrates the effect of changing the first mutation rate on the average number 
of generations to find at least one of the best trees. 
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Varying crossover and second mutation 




crossover or mutation rate 



Fig. 4. The effect of varying crossover rate while holding second mutation constant 
and of varying the second mutation rate while holding the crossover rate constant. The 
average generation at which the best fitness (72) was found is illustrated. 



Varying crossover and second mutation 




crossover or mutation rate 



Fig. 5. The effects of varying crossover rate while holding second mutation constant 
and of varying the second mutation rate while holding the crossover rate constant. The 
average number of best trees (45 max) found by each parameter setting is illustrated 

The results of this experiment clearly indicate that higher rates of this form 
of mutation are not beneficial. Furthermore, this operator is not clearly con- 
tributing to the search. 
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Fig. 6. The effect of varying the first mutation rate while holding crossover and second 
mutation constant. The crossover rate is 100% for both graphs; second mutation rates 
of 100% and 0% are shown. The average generation at which the best fitness (72) was 
found is illustrated. 
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6 Conclusions and Future Work 

The GA search process as implemented in Gaphyl represents an improvement 
over Phylip’s search process in its ability to find more trees than Phylip in the 
same runtime. One possible facet of this success is that the Gaphyl search process 
is independent of the number of attributes (and attribute-values) ; the complexity 
of the search varies with the number of species (which determines the number of 
leaf nodes in the tree) . Phylip uses attribute information in its search process. 

The first mutation operator is perhaps the “obvious” form of mutation to 
implement for this problem, and yet, its use (at high levels) appears to detract 
from the success of the search. This points to the importance of evaluating the 
contributions of operators to the search process. 

There is obviously a wealth of possible extensions to the work reported here. 
First, more extensive evaluations of the capabilities of the two systems must 
be done on the angiosperms data set, including an estimate of the maximum 
number of trees of fitness 279 (and, indeed, whether 279 is the most parsimonious 
tree possible). This would entail more extensive runs with both approaches. 
Furthermore, as evidenced by the unexpected result with the mutation operator, 
the effect of the immigration operator in Gaphyl must be explored further. 

Second, more work must be done with a wider range of datasets to evaluate 
whether Gaphyl is consistently able to find a broader variety of trees than Phylip, 
and perhaps able to find trees better than Phylip is able to find. 

Third, Gaphyl should be extended to work with non-binary attributes and to 
handle data with unknown values. Since the ability to work with missing values 
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and a number of alternative metrics are already part of the Phylip implemen- 
tation, these changes should be straightforward in the Gaphyl system. This is 
particularly important in that phylogenetic trees are increasingly used by biol- 
ogists primarily with the A, C, G, T markers of genetic data. It should also be 
extended to implement and evaluate alternative evaluation metrics to Wagner 
parsimony. 

Finally, we need to compare the work reported here to other projects that 
use GA approaches with different forms of cladistics, including jZ] and 0. Both 
of these projects use maximum likelihood for constructing and evaluating the 
cladograms. The maximum likelihood approach (which is known as a “distance- 
based method”) is not directly comparable to the Wagner parsimony approach 
(which is known as a “maximum parsimony” approach). 
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Abstract. We introduce a spectrum of algorithms for measuring the 
similarity of high-dimensional vectors in Euclidean space. The algorithms 
proposed consist of a convex combination of two measures: one which 
contains summary data about the shape of a vector, and the other about 
the relative magnitudes of the coordinates. The former is based on a 
concept called bin-score permutations and a metric to quantify simi- 
larity of permutations, the latter on another novel approximation for 
inner-product computations based on power symmetric functions, which 
generalizes the Cauchy-Schwarz inequality. We present experiments on 
time-series data on labor statistics unemployment figures that show the 
effectiveness of the algorithm as a function of the parameter that com- 
bines the two parts. 



1 Introduction 

Modern databases and applications use multiple types of digital data, such 
as documents, images, audio, video, etc. Some examples of such applications 
are document databases 0, medical imaging and multimedia information 
systems PS|. The general approach is to represent the data objects as multi- 
dimensional points in Euclidean space, and to measure the similarity between 
objects by the distance between the corresponding multi-dimensional points ini 
Ej. It is assumed that the closer the points, the more similar the data objects. 
Since the dimensionality and the amount of data that need to be processed in- 
creases very rapidly, it becomes important to support efficient high-dimensional 
similarity searching in large-scale systems. This support depends on the devel- 
opment of efficient techniques to support approximate searching. To this end, 
a number of index structures for retrieval of multi-dimensional data along with 
associated algorithms for similarity search have been developed | ii im)i4| . For 
time-series data, there are a number of proposed ways to measure similarity. 
These range from the Euclidean distance to non-Euclidean metrics and the rep- 
resentation of the sequence by appropriate selection of local extremal points m 
Agrawal, Lin, Sawhney, and Shim Q considered fast similarity search in the pres- 
ence of noise, scaling, and translation by making use of the Loo norm. Bollobas, 

* Supported in part by NSF Grant No. CCR-9821038. 
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Das, Gunopulos, and Mannila |2] considered similarity definitions based on the 
concept of well-separated geometric sets. It has been noted in the literature how- 
ever, that as dimensionality increases, query performance degrades significantly, 
an anomaly known as the dimensionality curse iHiini . Common approaches for 
overcoming the dimensionality curse by dimension reduction are linear-algebraic 
methods such as the Singular Value Decomposition (SVD), or applications of 
mathematical transforms such as the Discrete Fourier Transform (DFT), Dis- 
crete Cosine Transform (DCT), or Discrete Wavelet Transform (DWT). In these 
methods, lower dimensional vectors are created by taking the first few leading 
coefficients of the transformed vectors |3j • 

This paper introduces a spectrum of similarity algorithms which consist of 
a convex combination of two different measures. A shape measure on high- 
dimensional vectors based on the similarity of permutations through inversion 
pairs, followed by an associated dimension reduction by bin-score permutations; 
and a symmetric magnitude measure based on the computation of the inner- 
product and consequently the cosine of the angle between two vectors by a low 
dimensional representation. 



2 The Main Decomposition 

An n-dimensional real vector x = {x\,X 2 , ■ ■ ■ , Xn) G K"' can be decomposed as a 
pair (s(x), o'(x)) where s(cc) is the sorted version of x into weakly increasing co- 
ordinates, and (t(x) is the permutation of the indices {1, 2, . . . , n} that achieves 
this ordering. We impose the additional condition that the elements of the per- 
mutation cr(a;) are put in increasing order on any set of indices for which the 
value of the coordinate is constant. For example when x = (3, 3, 1, 5, 2, 0, 1, 6, 1), 
s(x) = (0, 1, 1, 1, 2, 3, 3, 5, 6), and in one line notation, a(x) = 63795124 8. 
Note that in x the smallest coordinate value is Xg = 0, the next smallest is 
X3 = X 7 = Xg = 1, etc. Civen x,y € IR", we aim to approximate the Euclidean 
distance ||a: — ?/|| as a convex combination 

\s{x,y) + {1- \)n{x,y) , (1) 



where 

• s{x,y) is a measure of distance between s{x) and s{y) which is a symmetric 
function of the coordinates separately in x and y (we refer to this as the 
magnitude or the symmetric part), 

• 7r(a;, y) is a measure of the distance between the permutations o{x) and a{y) 
(we refer to this as the shape part), 

• 0<A<lisa parameter that controls the bias of the algorithm towards 
magnitude/symmetry versus shape. 
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In order for such a scheme to be useful, the individual functions s{x, y) and 
7r(a;, y) must be amenable to computation using data with reduced dimensional- 
ity <C n. In the technique proposed here, this reduced dimension can be selected 
separately and independently for the two parts. First we discuss the construction 
of the parts themselves and then present the results of the experiments. 

The outline of this paper is as follows. In Sect. 0we consider the fast ap- 
proximate calculation of s{x,y) which is based on a novel low-dimensional rep- 
resentation to compute the inner product introduced in and developed in |B| . 
Section 0 describes how to measure the distance Tr(x,y) on permutations with a 
low-dimensional representation. This is based on a metric on permutations that 
we introduce, and the approximation of the metric by bin-score permutations. 
Experiments on labor statistics time-series data are presented in Sect. 0 and 
conclusions in Sect . 0 

3 The Magnitude Part: Power Symmetric Functions 

Our representation of data in IR" with reduced number of dimensions m with 
m n for the computation of the magnitude part s{x,y) in ( 0 ) is based on 
a novel approximation for the inner product introduced in 0 and further de- 
veloped in 0. For integers n,p > 0 and z S IR", the p-th power symmetric 
function is defined by tf)p{z) = Note that the ordinary Eu- 

clidean distance between x and y and the power symmetric functions are related 



where < x,y >= x\yi -\-x 2 y 2 + ■ ■ ■ + Xnyn is the standard inner-product. Using 
the 'il’p(z) precomputed for each vector z in the dataset, we look for an estimate 
for < x,y > hy approximating its m-th power in the form 

<X,y>'^ PZ bi1pi{x)%l^i{y) + b2li^2{x)'ip2{y) ^ 'r bm'<pm{x)lpm{y) ( 3 ) 

for large n, where the bi are universal constants chosen independently of x and 
y. For each high-dimensional vector x, we calculate ipi{x),ip 2 {x), ■ ■ ■ , and 

keep these m real numbers as a representative of the original vector x. For a given 
query vector y, we compute ipi{y),'ip 2 {y), ■ ■ . ,'<Pm{y) and approximate < x,y > 
via o, and the Euclidean distance via (EJ. 

Our assumption on the structure of the dataset for the computation of s(x, y) 
by this method is as follows: it consists of n-dimensional vectors whose compo- 
nents are independently drawn from a common (but possibly unknown) distri- 
bution with density H2j. In 0 the best set of constants 6i , & 2 , • ■ • , for the 

approximation o in the sense of least-squares was computed. In particular for 
the uniform distribution and m = 2 the optimal values are shown to be 



by 



||x - y\\ = tp 2 {x) -k V' 2 ( 2 /) - 2 <x,y> 



( 2 ) 




( 4 ) 
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This means that for m = 2, < x,y > is approximated by the expression 






45 



tp2{x)'ip2{y) 



In fact in the general case of a density with i-th moment (about the origin), it 
can be proved [Z] that the constants 6 i, 62 are functions of the first four moments 
of the density f{x). They are given by the formulas 



bi = y\- 



62 = 




2y\ + - 3^i^2/i3 

M 2 + MiM 4 - 2^ii/X2AX3 ’ 
MiM3 - M 2 

M 2 + M?M4 - 2/Xi^2M3 



( 5 ) 



The moments of the uniform distribution are = l/(x + 1), for which the 
formulas in © reduce to the values in (i) above. 

A secondary problem of interest in the context of the determination of the 
best set of constants is dynamic in nature. When the contents of the database 
changes by adding new data vectors, for example, the parameters used for the 
approximation problem to the inner-product calculation can be adjusted effi- 
ciently. In particular, one need not know the density of the distribution of the 
coordinates in the dataset parametrically. The moments Ui can be estimated as 
the limit of the iV-th estimate Jm{N) as the dataset is accumulated via 



-|- 1 ) — 



JY_I_ 2^ + tN+i) ■ 



( 6 ) 



where In is the 7V-th sample coordinate observed. 



4 The Shape Part: Bin-Score Permutations 

For a permutation p = pip 2 ■ ■ ■ Pn ol the integers {1, 2, . . . , n} in one-line nota- 
tion, an inversion is a pair pi > pj corresponding to a pair of indices i < j ■ Let 
Inv{p) denote the total number of inversions of p. For example for p = 4 3 5 2 1 
the set of inversions is {(5, 2), (5, 1), (4, 3), (4, 2), (4, 1), (3, 2), (3, 1), (2, 1)} and 
thus Inv{p) = 8. For any permutation p, 

0 < Inv{p) < jn{n — 1) 

with Inv{p) = diffp= 12 ---nis the identity permutation and Inv{p) = 
^n{n — 1) iff p = n - ■ - 2 1 is the reverse of the identity permutation. For the 
details of the underlying partially ordered set see mi. Inversions arise naturally 
in the context of sorting as a measure of presortedness uni when the number 
of comparisons is the basic measure. The idea of counting inversions is one of 
many ways of putting a measure of similarity on permutations j2|. Given two 
permutations p and r, we count the number of inversions p would have if we 
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were to use tiT 2 • • • r„ as the index set. In other words we compute Inv{pT ^ ). 
Put 

2 

t) = Inv{pT~^) ( 7 ) 

to normalize this measure to the unit interval. Some relevant properties of tt are 
as follows 



1. 0 < n{p, t) < 1, 

2. 7t(p, r) = 0 iff p = t, 

3. 7t(p, r) = 1 iff p + Ti = n + 1 for i = 1, 2, . . . n, 

4. 7r(p,r) = 7r(r,p), 

5. 7t(p, r) < 7 t(p, 5) + 7t(S, t) for any permutation 5. 

In particular tt is a metric on permutations. However, we cannot realistically 
use the permutations p = cr(x) and r = a(y) introduced in Sect. Q to compute 
this distance, since then there is no reduction in the dimension. The question is 
then whether or not approximations to the permutations p and r by some lower 
dimensional representation can be made, that would allow us to compute this 
measure without much deviation from the actual value. 

To this end, we consider bin-score permutations. For simplicity, assume n = 
2'' and p is a permutation on {1, 2, . . . , n}. For any integer s = 0, 1, . . . , r, we may 
divide the index set into b — 2^ consecutive subsets (bins) of length b' = ^ = 2’’“® 
each. The z-th bin is described by the b' consecutive indices 

ii = {i — l)b' + 1, Z2 = (* — 1)^^ + 2, if = {i — l)b' + b' . 

The score of this bin is the sum pq + Pij + ’ ’ ’ + Pif ■ In this way we construct a 
myopic version of p on {1, 2, . . . , 6} obtained by placing 1 for the smallest entry 
among the bin-scores computed, 2 for the next smallest, etc. In case of ties, we 
make the indices increase from left to right, as in the case of the construction 
of cr{x) described in Sect. (in fact, this permutation is simply a(x') where 
x' is the 6-dimensional vector of bin-scores of a;). The bin-score permutation 
corresponding to 6 = n is p itself, and for 6 = 1 it is the singleton 1. As an 
example, for n = 8, the bin-score permutations of p = 5 8 2 6 4 3 1 7 for 6 = 4, 2 
are obtained from the scores 13, 8, 7, 8, and 21, 15 as the permutations 4 2 13 
and 2 1, respectively. Note that any bin-score permutation can be obtained by 
repeated application of the b' = 2 case. 

5 Experiments 

For the experiments, we have used time series data of the seasonally adjusted 
local area unemployment rate figures (Local Area Unemployment Statistics) 
for the 51 states supplied online by the U.S. Department of Labor’s Bureau 
of Labor Statistics. The monthly rates were extracted for 256 months, covering 
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the period between January 1979 through April 2000 for each stat^. The dataset 
we used for the experiments conducted consisted of 51 vectors in of the 
states, alphabetically ordered as Alabama through Wyoming, and indexed as 
x[l], x[2], . . . , a;[51]. Thus each x[i] is a vector in n = 256 dimensional space. 
For the query vector y, we used the unemployment rate figures for the same 
period for the seasonally adjusted national average figures. The purpose of the 
experiments can be thought of as determining which state in the union has had 
an unemployment rate history that is closest to the national average for the 
period of time in question, where closest can be given different meanings by 
altering the bias parameter A of the algorithm. 

5.1 Estimation of the Parameters: The Magnitude Part 

The maximum coordinate over all the vectors in the dataset was found to be 
19.5 and the minimum entry as 2.1. Since we have no reason to expect the data 
to be uniform in this interval, we computed h\ and 62 using o after a linear 
normalization to the unit interval. 
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Fig. 1. The estimate to 61 and 62 computed for 50 vectors, each of dimension 16 with 
entries from the uniform distribution on [0,1]. The theoretically obtained asymptotic 
values are 61 = —0.0625, 62 = 0.703125 



To compute the number of sample vectors of dimension 256 required to obtain 
a meaningful estimate using © and the estimates of the first four moments of the 
density obtained through (EJ, we first experimented with uniform distribution 
for which we know the asymptotic values h\ = — Two to three 

256-dimensional vectors were enough for convergence. We generated 50 vectors 
of dimension 16 each. The corresponding values of h\ and &2 calculated from 
the estimates of the moments are shown in Fig. Q We see that the convergence 
requires about 20 vectors or about 300 samples. This means that the lower 
bound on the number of 256 dimensional vectors we need to obtain a reasonable 
estimates to b\ and 62 in the general case is very small. 

^ In California unemployment rates, the 1979 year data supplied were 0.0. These were 
all replaced with the January 1980 value of 5.9. 



Parametric Approximation Algorithms 



85 



Computing with 51 x 256 = 13056 normalized sample coordinates in the 
dataset, the approximate moments were calculated by Mathematica to be /ri = 
0.236, /X 2 = 0.072, /is = 0.026, /X 4 = 0.012. Using the formulas ® gives 

61 = 0.017 , 62 = 0.415 ( 8 ) 

With these values, we computed the the summary data '(/'i(a;[l]), . . . , '0i(2;[51]) 
and ijj 2 {x[l]), . . . ,ip 2 {x^l]) required. 

The following vector of length 51 gives the (approximate) ^/>i values of the 
normalized vectors computed in this fashion: 



85 . 3 , 95 . 5 , 58 . 4 , 74 . 1 , 74 . 1 , 47 . 9 , 43 . 7 , 48 . 5 , 86 . 0 , 58 . 3 , 52 . 1 , 43 . 3 , 66 . 6 , 73 . 2 , 63 . 1 , 43 . 3 , 36 . 7 , 
75 . 2 , 92 . 3 , 58 . 7 , 48 . 2 , 47 . 1 , 90 . 9 , 41 . 2 , 87 . 8 , 56 . 9 , 65 . 0 , 22 . 7 , 60 . 1 , 34 . 2 , 58 . 6 , 78 . 2 , 66 . 2 , 45 . 0 , 

34 . 4 , 72 . 0 , 52 . 1 , 74 . 6 , 69 . 1 , 60 . 4 , 60 . 6 , 27 . 2 , 66 . 7 , 62 . 9 , 44 . 4 , 40 . 5 , 40 . 3 , 75 . 7 , 117 . 7 , 51 . 6 , 53.1 

and the following the '02 values computed 



34 . 6 , 37 . 6 , 15 . 3 , 23 . 8 , 23 . 5 , 10 . 6 , 9 . 0 , 11 . 8 , 31 . 1 , 14 . 7 , 11 . 5 , 8 . 8 , 19 . 4 , 24 . 6 , 21 . 0 , 10 . 1 , 5 . 8 , 
25 . 7 , 38 . 0 , 15 . 6 , 10 . 4 , 11 . 4 , 41 . 6 , 8 . 4 , 34 . 3 , 15 . 0 , 17 . 6 , 3 . 1 , 16 . 5 , 6 . 9 , 15 . 3 , 25 . 3 , 18 . 5 , 10 . 2 , 
5 . 5 , 25 . 1 , 12 . 9 , 24 . 9 , 22 . 1 , 17 . 5 , 17 . 2 , 3 . 5 , 21 . 3 , 16 . 9 , 9 . 9 , 7 . 8 , 7 . 5 , 25 . 8 , 62 . 0 , 14 . 4 , 13.2 

For example for the state of Alabama, the summary magnitude information is 
0i(a:[l]) = 85.3 and 02 (a:[l]) = 34.6. 

We also calculated that for the query vector y of normalized national average 
rates, the two 0 values are 

0i(y) = 63.9 and 02 (y) = 17.8. 

Now for every vector x in the dataset, we calculate the approximation to 

< x,y > as 

\/ 6 i 0 i(a;) 0 i(y) + & 202 (a:) 02 ( 2 /) 

where hi and 62 are as given in (0. Therefore as a measure of distance of the 
symmetric part, we set 

s{x,y) = ^/| 02 (a;) + 02 (y) - 0.035034201 (a:)0i(?/) - O.8293802(a;)02(2/)| 
by using ( 0 . 



To see how the approximations to the magnitude part and the actual Eu- 
clidean distance values compare, we plotted the normalized actual values and the 
normalized approximations for the 51 states in Fig. E| Considering that we are 
only using the m = 2 algorithm for the computation of s{x, y), i.e. the dimension 
is vastly reduced, the results are satisfactory. 
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Fig. 2. The magnitude part: normalized actual distances (left), normalized approxi- 
mations (right) 



5.2 Estimation of the Parameters: The Shape Part 

To get an idea on the number of bins b to use for the computation of the ap- 
proximate values Tr{x,y), we calculated vectors of distances Tr{x,y) through the 
expression (0 with bin-score permutations instead of the actual permutations. 
Bin-score permutations for each vector . . . , x[51] and the query vector y 
was computed for b ranging from 4 to 256. The resulting distances are plotted in 
Fig.0 From the figure, it is clear that even 6 = 8 is a reasonable approximation 
to the actual curve (i.e. b — 256). In the experiments we used the case of 5 = 16 
bins. 




Fig. 3. The shape part: plot of bin-score permutation distances for 4-256 bins 



6 Parametric Experiments and Conclnsions 

In Figs. 5-8, the data plotted is the seasonally adjusted monthly unemployment 
rates for 256 months spanning January 1979 through April 2000. In each figure, 
the plot on the left is the actual rates through this time period, and the one on 
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the right is the plot of the rates sorted in increasing order. Consider the function 
m for A changing from 0 to 1 in increments of 0.01, where for each A value, 
s(a;, y) makes use of the approximate distance computation described for m = 2, 
and Tr{x,y) is the distance between the bin-score permutations cr{x) and a{y) 
for 6 = 16 bins. For each value of A in this range we computed the state (i.e. the 
vector where the minimum approximate distance is obtained. 

• For 0.0 < A < 0.5, the minimum is obtained at x[15], which corresponds to 
the state of Indiana, 

• For 0.5 < A < 0.9, the minimum is obtained at cc[46], which corresponds to 
the state of Vermont, 

• For 0.9 < A < 1.0, the minimum is obtained at cc[ll], which corresponds to 
the state of Georgia. 

The observed “continuity” of these results as a function of A is a desirable 
aspect of any such family of algorithms. 

Figures 5-8 indicate the behavior of the algorithm on the dataset. For small 
values of A, the bias is towards the shapes of the curves. In these cases the 
algorithm finds the time-series data of the national rates (Fig. 8, left), resemble 
most that of Indiana (Fig. 5, left) out of the 51 states in the dataset. On the 
other extreme, for values of A close to 1, the bias is towards the magnitudes, and 
the algorithm finds the sorted data of the national rates (Fig. 8, right), resemble 
most that of the state of Georgia (Fig. 7, right). The intermediate values pick 
the state of Vermont (Fig. 6) as the closest to the national average rates. 

In conclusion, we proposed a spectrum of dynamic dimensionality reduc- 
tion algorithms based on the approximation of the standard inner-product, and 
bin-score permutations based on an inversion measure on permutations. The 
experiments on time-series data show that with this technique, the similarity 
between two objects in high-dimensional space can be well approximated by a 
significantly lower dimensional representation. 

We remark that even though we used a convex combination of the two mea- 
sures controlled by a parameter A, it is possible to combine s{x,y) and Tr{x,y) 
for the final similarity measure in many other ways, 

s{x,y)^TT{x,yy~^, 

for example. In any such formulation, the determination of the best value of A 
for a given application will most likely require the experimental evaluation of the 
behavior of the approximate distance function by sampling from the dataset. 
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Fig. 4. x[15] = State of Indiana unemployment rate data: actual (left), sorted (right). 
For parameter A in the range 0.0 < A < 0.5, the algorithm picks the state of Indiana’s 
data as the one closest to the national average data in Fig. O 





Fig. 5. a;[46] = State of Vermont unemployment rate data: actual (left), sorted (right). 
For parameter A in the range 0.5 < A < 0.9, the algorithm picks the state of Vermont’s 
data as the one closest to the national average data in Fig. Q 





Fig. 6. a; [11] = State of Georgia unemployment rate data: actual (left), sorted (right). 
For parameter A in the range 0.9 < A < 1.0, the algorithm picks the state of Georgia’s 
data as the one closest to the national average data in Fig. O 
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Fig. 7 . Query data y = National average unemployment rates; actual (left), sorted 
(right). The dataset is the seasonally adjusted monthly unemployment rates for 256 
months spanning January 1979 through April 2000 
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Abstract. Statistical principles suggest minimization of the total within- 
group distance (TWGD) as a robust criterion for clustering point data 
associated with a Geographical Information System 1T71 . This NP-hard 
problem must essentially be solved using heuristic methods, although 
admitting a linear programming formulation. Heuristics proposed so far 
require quadratic time, which is prohibitively expensive for data mining 
applications. This paper introduces data structures for the management 
of large bi-dimensional point data sets and for fast clustering via inter- 
change heuristics. These structures avoid the need for quadratic time 
through approximations to proximity information. Our scheme is illus- 
trated with two-dimensional quadtrees, but can be extended to use other 
structures suited to three dimensional data or spatial data with time- 
stamps. As a result, we obtain a fast and robust clustering method. 



1 Introduction 

A central problem in mining point data associated with a Geographical Infor- 
mation System (GIS) is automatic partition into dusters p. Regardless of the 
method used, a clustering result can be interpreted as a hypothesis that models 
groups in the data. Glustering is a form of induction, and uses some bias, in 
order to select the most appropriate model. The optimization criteria that guide 
clustering algorithms have their basis in differing induction principles. When 
these optimization criteria are derived from statistical principles, it is possible 
to establish formal bounds on the quality of the result, as illustrated by the no- 
tion of statistical significance. In this paper, we concentrate on one such criterion 
studied in the statistical literature, where it was known as the grouping and 
the total within-group distance and in the spatial clustering literature as 
the full- exchange m — here, we refer to this criterion as the total within-group 
distance (TWGD). The criterion has several variants widely used in many fields. 

Unfortunately, the TWGD criterion is expensive to compute. Attempts in 
geographical analysis have involved only very small data sets (for example, Mur- 
ray HZ| applied it to less than 152 geographical sites). Minimizing the TWGD 
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turns out to be an NP-hard problem. Even though it admits a linear program- 
ming formulation, the number of constraints is quadratic, making even the use 
of linear-program solvers infeasible in practice. This leaves the option of using 
heuristic approaches such as hill-climbing iterative search methods to obtain 
approximate solutions of acceptable quality. However, traditional hill-climbing 
interchange heuristics require quadratic time in the number n of data items, 
again far too much time for data mining applications. 

In this paper, we show how a variant of the well-known quadtree data struc- 
ture can be used to support a hill-climbing interchange heuristic for an approx- 
imation of the TWGD optimization problem, in subquadratic time. By com- 
puting only an approximation to the traditional TWGD optimization function, 
great savings in time can be achieved while still producing robust clusterings. 
The TWGD problem is formally introduced in Sect. Q and a brief overview of 
its relationship to other clustering problems is given. After some background on 
iterative hill-climber searching in Sect.EI the quad-tree-based heuristics are pre- 
sented in Sect.0 The paper concludes in Sect. 0with remarks on experimental 
results and a brief discussion of alternative methods. 

2 Distance-Based Clustering 

We consider partitioning a set of n geo-referenced objects S = {si,...,s„} into 
k clusters. Each object Si G is a vector of D numerical attributes. Clus- 
tering methods typically rely on a metric (or distance function) to evaluate the 
dissimilarity between data items. In spatial settings, such as those associated 
with a GIS, the distance function measures spatial association according to spa- 
tial proximity (for example, the distance between the centers of mass of spatial 
objects may be used as the distance between objects jOj). While the analyses 
of large data sets that arise in such contexts as spatial data mining m and 
exploratory spatial data analysis (ESDA) mostly use the Euclidean distance 
as a starting point, many geographical situations demand alternative metrics. 
Typical examples of these are the Manhattan distance, network shortest-path 
distance, or obstacle-avoiding Euclidean distance. Our method is generic in that 
it places no special restrictions on d. Typical choices for d in spatial settings are 
the Euclidean distance EuCLiD(a;, y) = \/ {x — y)'^ ■ {x — y)^ the Manhattan 
distance \^i ~ Vilt or a network distance. When time is introduced, the 

user may choose a combined measure, for which our methods can also be ap- 
plied. For example, one possibility is that if 19 = 3, with two coordinates for 
spatial reference and the third co-ordinate for a time value, the distance metric 
could be d{x, y) = a - \/ {xi — yiY + {x 2 — 1 / 2 )^ + fd-\x^ — y^\, where a and [3 
are constants determining the relative contributions of differences in space and 
in time. Other alternatives involve cyclic measures in time that arise from the 
days of the week or the months of the year. In any case, although the distance 
d could be costly to evaluate, we assume that this cost is independent of n, al- 
though dependent on D. A clustering problem is distance-based if it makes use 



^ For vectors x and y, x'^ ■ y denotes their inner product. 
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of a distance metric in the formulation of its optimization criterion. One such 
family of criteria is the total within-group distance: 

Definition 1. Let S = {si, . . . , s„} C X be a set ofn objects and d : X x X ^ 
be a metric. The order-a TWGD“ clustering problem for k groups is: 

k 

Minimize TWGD“(P) = 

m—1 i<j A Si,Sj^Sm 

where P = S'!! . . . \Sk is a partition of S, a is some constant value {typically 
1 or 2) and wt is a weight for the relevance of Si, but may have other specific 
interpretations. If a is not specified, it will be assumed to be 1 by default. 

Intuitively, the TWGD criteria not only minimize the dissimilarity between 
items in a group, but also use all interactions between items in a group to assess 
group cohesiveness (and thus, use all the available dissimilarity information). 
Also, they implicitly maximize the distance between groups (and thereby min- 
imize coupling), since the terms d{si, Sj) not included in the sum are those for 
which the items belong to different groups. 

In the special case where d{x, y) = EuCLiD(a:, y), the literature on paramet- 
ric statistics has proposed many iterative heuristics for computing approximate 
solutions to the TWGD^ problem j7], all of which can be considered variants of 
expectation maximization (EM) using means as estimators of location. One 
very popular heuristic that alternates between estimating the classification and 
estimating the parameters of the model is fc-MEANS. A:-Means exhibits linear 
behavior and is simple to implement; however, it typically produces poor results, 
requiring complex procedures for initialization Other well-known variations 
of EM are the Generalized Lloyd Algorithm (or GLA), and fuzzy-c-clustering. 
With the possible exception of /c-G-L1-Medians m, all are representative- 
based clustering methods that use the Euclidean metric. 

These EM variants grant special status to the use of sums of squares of the 
Euclidean metric. This preference for a = 2 in the early statistical literature 
derives from the mathematical need to use differentiation and gradient descent 
for numerical optimization. However, despite its popularity, the case a = 2 has 
implications for robustness and resistance that ultimately affect the quality 
of the result. That is, using a = 2, rather than (say) a = 1, renders the algorithms 
far more sensitive to noise and outliers. Effective clustering methods can be thus 
devised by concentrating on the case a = 1 (medians rather than means). Still, 
it is not immediately clear that these methods can be as fast as A:-Means. 

3 Interchange Hill- Climbers 

The minimization of TWGD^(P) is typically solved approximately using inter- 
change heuristics based on hill-climbers |ibli7ll8l28| . Hill-climbers search the 
space of all partitions P = S’!] . . . |5'fc of S' by treating the space as if it were a 
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graph: every node of the graph can be thought to correspond to a unique parti- 
tion of the data. For the TWGD problem, two nodes P and P' are adjacent if 
and only if their corresponding partitions coincide in all but one data point. 

Interchange heuristics start at a randomly-chosen solution (that is, a 
random node in the implicit graph), and explore by moving from the current 
solution to one of its neighbors. One general interchange heuristic, originally 
proposed in 1968 by Teitz and Bart |^, is a hill-climber that is regarded as the 
best known benchmark P). We will refer to this heuristic as TaB. 

TaB considers the data points in a fixed circular ordering (si, S 2 , . . . , s„). 
Whenever the turn of data point Si comes up, it is considered for changing its 
group to any of the k — 1 others. The most advantageous interchange Pj of these 
alternatives is determined, and if it is an improvement over the current solution 
P*, then Pj becomes the new current solution otherwise, = P*. In 

either case, the turn then passes to the next data point in the circular list, Si+i 
(or Si if i = n). If a full cycle through the data set yields no improvement, a 
local optimum has been reached, and the search halts. 

Some care must be taken when evaluating the optimization criterion using 
TaB. Given a current partition P* and one of its fc — 1 neighbors Pj, a naive 
approach would compute TWGD(Pj) and TWGD(P*) explicitly in order to 
decide whether TWGD(Pj) < TWGD(P‘). A more efficient way computes 
the discrete gradient s/{P*,Pj) = TWGD(P‘) - TWGD(Pj). Since only s, 
is changing cluster membership, TWGD(P‘) and TWGD(Pj) differ only in 
&(n) terms, and so 0{n) evaluations of the distance metric are required to 
compute \/{P*, Pj). Therefore, the number of evaluations of the distance metric 
required to test all interchanges suggested by Si is in 0{kn). The generic TaB 
heuristic thus requires f2{n?) time per complete scan though the list. At least 
one complete scan is needed for the heuristic to terminate, although empirical 
evidence suggests that the total number of scans required is constant. 

4 Quad-Trees for Clustering 

The clustering heuristic we propose. Quad, differs from the original TaB heuris- 
tic in that, it replaces the discrete gradient Pj) by ^ computationally less- 

expensive approximation v“(P*, Pj). The calculation of \/~{P^ , Pj) is facilitated 
by a hierarchical spatial data structure, whose performance depends on the di- 
mensionality of the data. For simplicity, we present our method in the context of 
two-dimensional spatial data, using a variant of the PP-quadtree m- However, 
the method is easily extended to spatial and spatio-temporal data in three and 
higher dimensions with an appropriate choice of search structure: in particular, 
the strategy can immediately be applied to octrees in three-dimensional settings. 
Pseudocode descriptions of both TaB and Quad appear in an extended technical 
report H2|. Also, our algorithm is generic in the value of a, since the computation 
of d(si, Sj)°' can be considered as part of the dissimilarity computation. 

The quadtree of a point set encodes a partition that covers the plane. The 
coordinates of the point p at the root divide the plane into 4 logical regions by 
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means of the vertical and horizontal line through p. Then recursively, these four 
regions are further decomposed until each rectangular region contains a subset 
of the input of suitable size. The quadtree in Quad preserves the property that 
data points are stored at leaves, but internal nodes use data elements as their 
split points. It also follows the convention that the upper and right bounding 
sides of each quad-region (including the corners) are open, while the remainder 
of the boundary is closed. We can easily assume that all data points are distinct, 
since we can use the weight Wi in Definition Dto represent the number of copies 
of point Si. Using 0(n log n) comparisons, the original instance of TWGD can 
therefore be reduced to one with no duplicate data points. Also, we can ensure 
that a particular data element is never used more than once as a split point. 
These adjustments to the Pi?-quadtree allow us to represent all data points of 
the two-dimensional plane independently of the precision at which the points are 
represented, and to interleave the temporal dimensions in case the metric used is 
cyclic. Figure [Dshows a data set and a corresponding Pi?-quadtree, where splits 
were chosen randomly. 

Our quadtree shares the characteristics of other variants in that its expected 
depth is logarithmic, and thus the routines for insertion, deletion and search re- 
quire 0(log n) time. We will not elaborate on these data management operations, 
as their particular implementations can be easily derived from those of other 
representatives of the quadtree family m- Moreover, these data-management 
operations do not involve distance computations, but only require comparisons 
with the D coordinates of split points. We will include the cost of constructing 
the data structure as part of the clustering, but emphasize that the time required 
does not change with the choice of distance used. Thus, for our purposes, our 
routine Compute_Quadtree works by repeated insertion of a random shuffle 
of the data and requires a total of 0{Dnlogn) expected comparisons. 

For the operation of Quad, the internal nodes of the quadtree store additional 
information pertaining to the current best partition P* = Pjl . . . IPl.. For all 
1 < i each internal node u stores the total weight of points belonging 
to cluster j under the subtree rooted at u. Thus, when all weights are 1, T^[j] 
is the number of points belonging to cluster j under v. For the root p of the 
quadtree, we have Tp[j] = |5j| and X)j=i^pb] = Also, for all 1 < j < fc, 
each internal node stores the total vector sum of the data points under 

the subtree rooted at v that belong to cluster j. Note that from M and T we 
can compute the center of mass Ci,[j] = of all points under u that belong 

to cluster j in the current partition P*. The quadtree in Fig.Qshows the values 
of T for the data set shown to its left, in the case where 1 < j < fc = 3. 

Whenever a point Si changes its current cluster assignment, some of the 
values of T^,[j] and M^[j] must be updated. However, all nodes that may need to 
be updated lie along the unique path from the root to the leaf corresponding to 
sp also, the only values of j for which this need be done are those corresponding 
to the clusters between which Si migrates. More precisely, if Si migrates from 
cluster f to cluster j, then the changes required are (1) ^ + WiSi, 

(2) M,[f] ^ M^lf] - w,s,, (3) T^j] ^ T^j] + re,, and (4) n[f] ^ T^f] - 
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Fig. 1. A partition of the plane according to a two-dimensional data set, and its rep- 
resentation as a quadtree (leaves show cluster labels, internal nodes show T) 



for all V in the path from the root to Thus, if an interchange occurs, the 
change in representation of the current partition to the new partition 
is performed in O(logn) time, and does not involve distance computations. 

We now turn our attention to Pj) = MEDiAN(f, j')— MEDiAN(f, j) where 

MEDiAN(i, j) = X^sgS Wid{s^Si). We denote this sum as MEDiAN(i,j) since its 
value is an assessment of how well Si acts as the median of the cluster Sj . The 
value of MEDiAN(i,j') is known, having been computed when the most recent 
change to the partition occurred. Computing MEDiAN(i,j) exactly, on the other 
hand, would require 0(|51,|) time. The quadtree-based optimization function 
Sj~{P* ,Pj) approximates MEDiAN(i,j) with the value of 

APPX_MEDIAN(f, j) = ^ T^\j] ■ S,), (1) 

where J\fi is a set of nodes of the quadtree such that every leaf of the quadtree 
has exactly one element of J\fi as an ancestor. A valid choice of J\fi will be called 
an overlay for s^, by analogy with the overlay of two layers in CIS. For each node 
V £ Mi, the center of mass Cy[j] serves as an approximation of the location of the 
data points of Sj whose corresponding quadtree leaves have v as an ancestor. By 
using the distance from Si to the center of mass, and counting the distance once 
for each point of Sj approximated (that is, T,^[j\ times), the total contribution 
to MEDiAN(i,j) can be estimated. 

Although there are many possible ways in which Mi may be chosen, we pro- 
pose one which limits the error of approximation while still allowing for an eval- 
uation using O(logn) distance computations. Consider the unique path TTQ{si) 
in the quadtree Q from the root to the leaf corresponding to s^. This path has 
logarithmic expected length. The boxes in this path will be considered the first 
input layer to the overlay. FigureEliUustrates this. The top left region is the first 
layer corresponding directly with our quadtree Q. The figure in the top-middle 
corresponds to a virtual set of nodes that constitute a second layer, which we call 
the virtual neighborhood. This second layer C is built from the virtual complete 
tree V as follows. We initialize Ci to hold the leaf Vi corresponding to s^. Then 
for every node v S 7Ty(si) (this is the path from the root in V), ordered from 
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the leaf up to the root, we add to Ci the siblings of v in the quadtree V, the 
nodes representing the neighboring regions of i/ at this level, if they exist, and 
the siblings of these neighboring nodes — provided that none of their descen- 
dants have already been added. A node ly' is a neighbor of ly if its region shares a 
common bounding edge (face boundary in 3D) with that of ly. Mi is the overlay 
of Ci and 'Kq{si) and collects nodes that are deeper in Q but close to Si. 

Indexing techniques allow the neighbors of v to be determined in constant 
time These techniques are well known, having been used for computing 
linear orders (Morton orders) of the nodes. The virtual neighborhood can be 
computed in O(logn) time from Morton order indices. Thus computing our 
proposed overlay requires time proportional to the length of the path; that is, 
O(logn) distance evaluations. Note that all siblings in the tree are neighbors, 
but not all neighbors are siblings: in the case D = 2, a node can have up to 3 
siblings and 8 neighbors. Figure El shows the construction of the overlay for point 
(0.58,0.53) in the example data set of Fig. [D The top left diagram shows the 
logical structure for the quadtree of Fig. EJ There are 12 leaves in the quadtree, 
appearing as white squares (the gray squares correspond to empty nodes). The 
neighborhood of (0.58,0.53) across all levels is shown in the diagram in the top 
center; the diagrams at the bottom show the neighbors at depth 3, depth 2 and 
depth 1 respectively. The nodes appear in black, while neighbors and siblings of 
neighbors appear with different patterns. 

Since the overlay includes only a constant number of nodes at each level 
along the path 7r(si), the size of the overlay is logarithmic, and therefore the 
overall time required to determine the overlay is also in O(logn). Consequently, 
V“(P‘, Pj) requires only O(logn) distance computations. In general, for octrees 
or their higher-dimensional equivalents, the constant of proportionality in these 
complexities rises exponentially with D. Nevertheless, the constants are suffi- 
ciently small for the method to be efficient for two- and three-dimensions. 

The proposed overlay construction is such that the area (or volume) covered 
by a nodes is smaller as the region is closer to Si. In particular, when D = 2, it is 
not possible to travel in the plane from the (larger) region of an overlay node at 
depth S to the point Si without crossing the (smaller) region of an overlay node at 
depth 5+2, unless point Si is at a leaf with depth less than <5-1-2 (in which case the 
approximation is exact, as its subtree stores only one data point). This property 
contributes to the accuracy of the proposed approximation in Equation ([Q) . The 
following lemma illustrates the claim for the Euclidean metric, but can easily be 
extended to other metrics. 

Lemma 1. Let s and Si be data items in D-dimensional spaee sueh that the 
Euelidean distanee d{si,s) is approximated by d{si,c^[j]), where v is the unique 
overlay node to which s is assigned. Then the relative error is no more than 2. 

Proof. We may assume that Si is at the origin of the Z3-dimensional space. If 
V is & neighbor or sibling in the virtual overlay of the box containing Si the 
approximation is exact. If v is further up the overlay Mi), then the line ssi cuts 
at least one virtual neighbor or sibling of Si. In this case, the worst-case absolute 
error occurs when s is at a corner of the Z?-dimensional box (quad) corresponding 
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r,)uadtrcc for example for (0.58.0.53) lor (0.58.0.53) 




Dcplh 3 




Fig. 2. The logical structure of the quadtree of Fig. Q the neighborhood and the 
overlay for (0.58, 0.53) 



to node v, and the point Ci,[j] is at the diametrically opposite corner of the quad 
(the error is much less in real data sets since c^[j] is the center of mass of all 
points in the subtree rooted at i/). Also, the absolute error ||d(si, Cy[j]) — d(si, s)|| 
is maximum when the three points Si, s and c,y[j] are aligned. Let I be the length 
of side of the L)-dimensional box B where Si is. Then, neighbors and siblings 
of the parent of B in the overlay are D-dimensional boxes with length 21 per 
side. In general, neighbors and siblings of the i-th ancestor are D-dimensional 
boxes with length 2T per side. Thus, if is a neighbor or sibling of the h-th 
ancestor, the maximum absolute error is the Euclidean metric of the vector with 

all entries equal to 2^*^ Thus, ||d(si, — d(si, s)|| < = 12^'/D. 

Because s is in the region covered by the D-dimensional box corresponding to 
V, and there is at least one neighbor (or sibling) between Si and s of side length 
2*Z, for z = 1, . . . , — 1 we have d{si, s) is at least the Euclidean norm of the 

vector who has all entries equal to 2T. Since the norm of this vector is 

1\/1j{2^ — 1) the claim follows 



\\d{si,c4j]) - d(si,s)|| ^ 12'^Vd ^ ^ ^ 

d{s„s) ~;(2'*-1)\/D 

Note that this result is independent of I (which is proportional to the depth of Si 
and the dimensions D. This property does not hold for a standard Quadtree since 
a box may have an arbitrarily larger box as a neighbor. Naturally, the absolute 
approximation error increases with the distance d(si, s); however, the likelihood 
of s contributing to either cluster containing Si diminishes with distance. 
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5 Discussion 

Algorithms for TWGD in the case a = 1 has been implemented for spatio- 
temporal clustering m In this context, using the Euclidean distance, the clus- 
terings generated were more robust than when using TWGD^ or EM. This is not 
surprising since fc-MEANS and EM work well on spherical and ellipsoid-shaped 
clusters, respectively. Also, experiments show im that with a subquadratic al- 
gorithm for TWGD, one million data points can be clustered in the same CPU- 
time that previous quadratic algorithms required for only 10 to 20 thousand 
points. The subquadratic algorithms presented in PI are based on a different 
technique, randomization. While the complexity of those algorithms also does 
not depend exponentially upon the dimension D, with respect to the number 
of records n, their complexities are only in 0{ny/n). For the case when D is 
small (two or three dimensional data with possibly an additional time dimen- 
sion), the 0(n log n) TWGD algorithms presented here are considerably faster. 
Of the algorithms proposed. Quad has been implemented in). The experiments 
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Fig. 3. (a) Evaluating the interchange decisions made by Quad; (b) Illustration of 
the CPU-time requirements of TaB and Quad 



on CPU-time performance illustrate the scalability of our algorithm even though 
the metric used was the relatively-inexpensive two-dimensional Euclidean metric 
(see Fig. &)• For more complex and expensive metrics, such as network distances 
or obstacle-avoiding shortest paths, the improvements from our methods would 
be even more dramatic. Also, the quality of the approximation has been con- 
firmed experimentally PI- For 10 runs, the outcome of interchange decisions 
using the approximate gradient computation was contrasted with the decisions 
using the exact gradient computation. Figure shows 95% percent confidence 
intervals for the results. From Table EIl we see that only a fraction of one percent 
of the decisions made by Quad differ from those that would have been made by 
Tab. 

The techniques for fast approximation of quadratic-time processes presented 
here have parallels with similar process approximation in other areas, such as the 
simulation of particle motion m, graph drawing E2| and facility location 0. 

For particle motion, approximation of proximity information is used for the 
computation of the dynamics of the gravitational n-body problem. The goal is 
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to avoid the direct computation of all forces between all pairs of particles, as this 
number is quadratic. These techniques naturally extend to spring-based graph 
layout algorithms eg, in which the nodes of the graph are treated as electrically- 
charged particles and edges as springs. Here, the interplay among the repulsive 
forces between nodes, and the repulsive and attractive forces along edges, serves 
to stretch the graph out while keeping the edge lengths reasonably balanced. 

Approximation of proximity has also been explored to some extent in two- 
and three-dimensional clustering settings, where hierarchical structures storing 
aggregated information have been proposed f27l,'tl fj . However, these attempts 
suffer from imposition of a grid on the data where the number of cells grows 
quadratically in two dimensions, and cubically in three dimensions; also, for some 
methods, determining an appropriate granularity for the grid can be problematic. 

Two of these methods deserve special mention. The BIRCH method saw 
the introduction of a hierarchical structure for the economical storage of grid 
information, called a clustering feature tree (CF-tree) [,'t I \ . The methods of our 
paper can be adapted to use the information summarized at the nodes of a CF- 
tree for approximation of the discrete median, also using O(nlogn) comparisons 
in low dimensional settings. Compared to the quadtree-based method, one would 
expect that the constants of proportionally for the CF-tree version to be slightly 
smaller. However, due to the level of aggregation and sampling, one would also 
expect the quality of the results to be poorer than with quadtrees. 

The STING method |HI]| combines aspects of several approaches. In 2D, 
STING uses a hierarchical data structure whose root covers the region of anal- 
ysis. As with a quadtree, each region has 4 children representing 4 sub-regions. 
However, in STING, all leaves are at equal depth in the structure, and all leaves 
represent areas of equal size in the data domain. For each node n, statistical in- 
formation is computed — namely, the total number of points that correspond 
to the area covered by the node z/, the center c^, of mass, the standard deviation 
cr,j, the largest values max^, and so on. The structure is built by propagating 
information at the children to the parents according to arithmetic formulae; for 
example, the total number of points under a parent node is obtaining by sum- 
ming the total number of points under each of its children. When the STING 
structure is used for clustering purposes, information is gathered from the root 
down. At each level, distribution information is used to eliminate branches from 
consideration. As only those leaves that are reached are relevant, the data points 
under these leaves can be agglomerated. It is claimed that once the search struc- 
ture is in place, the time taken by STING to produce a clustering will be sublin- 
ear. However, as we indicated earlier, determining the appropriate depth of the 
structure (or equivalently the granularity of the grid) is a considerable challenge. 
STING would achieve the precision of the Quad algorithm only when the grid 
is sufficiently fine for every data point to lie in just one data cell — however, in 
this case the algorithm would have quadratic complexity. 

It should be noted that all the methods mentioned above, including the 
approximate Quad method, suffer in that the size of the data structure would 
expand rapidly with the dimensionality of the data. 
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An alternative strategy to our approach here is the use of spatial data struc- 
tures (i?*-trees) to either sample the data set or maintain a Voronoi structure to 
speed up interchange heuristics 0. This has been applied with particular atten- 
tion to I/O operations required since this is the dominant term when clustering 
is at the interface of spatio-temporal data mining and spatial database system. 
However, this is restricted to the Euclidean distance and to medoids 0 . Medoid 
methods are comparable to the case a = 1 of the TWGD since they are analo- 
gous to the minimization of an Li-loss function, and since O(nlogn) algorithms 
have been achieved though approximation for D = 2 mg. 

The strategy of facility location by aggregation has generated much debate 
concerning the amount of introduced error (see j.'ipiSj and their references). The 
common recommendation is to minimize its use; however, this is driven by an 
interest in optimizing the associated cost rather than the quality of the clustering. 
While facility location is generally concerned with larger problems, these are 
usually smaller than those arising in most data mining applications. Approximate 
solutions using aggregation provide a basis for new aggregation schemes that can 
again be solved approximately. Iterative refinements of such schemes can result 
in reduction of the error 0. Also, algorithms suited to large problems can reduce 
the proportion of aggregation required to obtain more accurate solutions. In this 
way, our approximation methods can contribute directly to improvements in 
existing facility location strategies. 

Finally, there is the perception that partitioning algorithms require a priori 
knowledge of the number k of clusters. But, a fast algorithm that can robustly 
cluster for a given k is an effective tool for determining the appropriate number 
of clusters. An algorithm that quickly evaluates L{k) can be used as the 

basis of an algorithm with optimization criterion L that uses hill-climbing on k 
to determine k. Methods such as AutoClass and Snob use hill-climbing with an 
initial random fcg to effectively determine an appropriate choice of k. 
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Abstract. We provide sub-quadratic clustering algorithms for generic 
dissimilarity. Our algorithms are robust because they use medians rather 
than means as estimators of location, and the resulting representative of 
a cluster is actually a data item. We demonstrate mathematically that 
our algorithms converge. The methods proposed generalize approaches 
that allow a data item to have a degree of membership in a cluster. 
Because our algorithm is generic to both, fuzzy membership approaches 
and probabilistic approaches for partial membership, we simply name it 
non-crisp clustering. We illustrate our algorithms with categorizing WEB 
visitation paths. We outperform previous clustering methods since they 
are all of quadratic time complexity (they essentially require computing 
the dissimilarity between all pairs of paths). 



1 Introduction 

In a top-down view to clustering the aim is to partition a large data set of 
heterogeneous objects into more homogeneous classes. Many clustering methods 
are built around this partitioning approach ca. Because we are to partition a 
set into more homogeneous clusters, we need to assess homogeneity. The de- 
gree of homogeneity in a group is a criterion for evaluating that the samples in 
one cluster are more like one another than like samples in other clusters. This 
criterion is then made explicit if there is a distance between objects. The type 
of clustering that we find here attempts to find the partition that optimizes a 
given homogeneity criterion defined in terms of distances. Thus, this type of 
clustering is also distance-based, but rather than looking for the most similar 
pair of objects and grouping them together, which is the bottom-up approach of 
hierarchical clustering, we seek to find a partition that separates into clusters. 
Variants of the partitioning problem arise as we see different criteria for defining 
homogeneity and also different measures of similarity. Then, within one variant 
of the problem, several algorithms are possible. Consider the following optimiza- 
tion criteria for clustering which attempts to minimize the heterogeneity in each 
group with respect to the group representative. 

n 

Minimize M“(C) = ^ d{ui, KEP[ui, C])“, (1) 

L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 103-|11^ 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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where a > 1 is a constant, C C ?7 is a set of k representatives, REP[ui,C] is the 
most similar representative (in C) to Ui and d(-, •) is a measure of dissimilarity. 
We underline that d(-, •) does not need to be a distance satisfying the axioms of 
a metric. In particular, while d{ui,Ui) = 0, we do not require that d{ui,Uj) = 
0 implies Ui = Uj. Also, we do not require the triangle inequality, however, 
we do expect symmetry, that is d{ui,Uj) = d{uj,Ui). These requirements are 
satisfied by all similarity measures based on computing the cosine of the angle 
between attribute- vectors of positive- valued features j1 . Note that 

these similarity functions sim{-,-) have a range in [0,1] and the corresponding 
dissimilarity is d{-, •) = 1 — sim{-, •), also in the range [0,1]. 

Equation dU is not unfamiliar to statistics nor to the Data Mining commu- 
nity. The case a = 2 defines the objective function that the /c-Means method it- 
eratively approximates. This case is the result of applying an analysis of variance 
using total sum error squares as the loss function. The case a = 1 replaces means 
by medians and the L 2 loss function by the Li loss function (the total absolute 
error). The case a = 1 was brought over from the statistics as medoid-based 
clustering However, CLARANS is a randomized interchange hill-climber in 
order to obtain subquadratic algorithmic complexity. CLARANS can not guar- 
antee local optimality. The best interchange hill-climber El is the Teitz and 
Bart heuristic m which requires quadratic time. Only in restricted cases, the 
time complexity of this type of hill-climber has been reduced to subquadratic 
time (for example, D — 2 and Euclidean distance jOj). Because medians are a 
more robust estimator of location than means, an algorithm optimizing the case 
a = 1 is more resistant to noise and outliers than an algorithm optimizing the 
case 0 = 2. However, A:-Means is heavily used because it is fast , despite 

its many drawbacks documented in the literature m- 

Clustering algorithms optimizing the family of criteria given by Equation 0 
search for a subset C = {ci, . . . , Ck] of k elements in U. In the case a = 1, we 
use medoids to denote discrete medians; that is, the estimator of location for 
a cluster shall be a member of the data. Previous to the approach presented 
here, the classification step for medoids performs a crisp classification. That is, 
classification computes the representative REP [rti,C] of each data item Ui and 
each data item belongs to only the cluster of its representative. It seems that 
robustness to initialization (as it happens in fc-HARMONiC Means El) comes 
from a combining relaxation of crisp classification with a boosting technique. It 
seems that to achieve this, each data item should be able to belong to different 
clusters with a degree of membership (the degree could be a probability as in 
Expectation Maximization, or a fuzzy membership as in fuzzy-c-Means). 

Thus, our contribution here is to achieve this degree of membership to dif- 
ferent clusters. We will propose algorithms and then show their advantages. We 
prove mathematically that they converge. We show our algorithms are generic 
(the degree of membership could be a fuzzy-membership function, a harmonic 
distribution, or even revert to the case of crisp-membership). We show versions 
requiring sub-quadratic time by using randomization. We apply our algorithms 
to a case study where other algorithms can not be used mm- 
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2 Non-crisp Clustering Algorithms 

Consider the crisp classification step that computes REP[ui, C] for each data item 
Ui- It requires a simple pass through the data and 0{nk) computations of d(-, •). 
This results in a temporary partition of U into k clusters Ui, . . . ,Uk- We can 
generalize crisp classification by assigning a vector G [0,1]^ to each Ui. We 
consider that the j-th entry denotes the degree by which Ui belongs to the j- 
th cluster. In parallel with Expectation Maximization and fuzzy-c-Means, 
we will normalize the values so that ~ Crisp classification means 

the closest representative has its entry in tt set to 1 while all other clusters have 
their entry set to zero. Non-crisp classification will mean more than one cluster 
has its entry different from zero. 

To detail more our approach we need to introduce some notation. Given a 
vector X G let SORT(a;) G 3?* denote the vector of sorted entries from x in 
non-decreasing order (entries with equal values are arranged arbitrarily). Thus, 
if J < j' then SORT(ai)j < SORT(a;)j/. We use the notation ej to denote the j-th 
canonical vector (0, . . . , 0, 1, 0, . . . , 0) that has only the j-th entry equal to 1 and 
all other equal to zero. This notation allows us to rewrite the loss function in 
Equation m because the minimum operator in REP[ui,C] can be replaced by 
eJ ■ sort(-) as follows M‘^{C) = ' SORT(d(ui, ci),d(ui, C2), . . . , d(ui, Ck)). 

Moreover, we can already describe a Harmonic set of weights as the degrees of 
membership. Let K = I/Jj the Harmonic loss function is then M^{C) = 

^ Er=i(i-> 1/2, • ■ • , 1/j, • ■ ■ , 1/^)^ • SOKT{d{xi, ci),d{ui, C2), . . . , d{ui, Ck)). More- 
over, we propose here an even more general approach, and let w G be any 
vector with non-negative entries and entries in descending order; then, the non- 
crisp classification loss function with respect to w is 

1 " 

= TTAT ■ SORT(d(ui,ci),d(ui,C2),. ■ . ,d(ui,Ck)). (2) 

(where we denote the 1-norm of a vector x as ||ai||i and it equals l®fc|)- 

Note that the first canonical vector ei and the Harmonic vector with 1/j in its 
j-th entry are special cases for io. The vector giving the degree of membership 
of Ui is defined by iVij = d{ui,Cj)ujj/\\oj\\i. 

Recently, by using randomization several variants of clustering algorithms for 
optimizing instances of the criteria in Equation with 0=1 have emerged mu. 
These variants are subquadratic robust clustering algorithms. These randomized 
algorithms have been shown mathematically and empirically to provide robust 
estimators of location m- This is because randomization is not sampling. Sam- 
pling reduces the CPU-requirements by using a very small part of the data, and 
the accuracy suffers directly with the size of the sample. Randomization uses 
the entire data available. 

While those algorithms are more robust than /c-Means to initialization, we 
produce here a general family of algorithms that optimize Equation O and are 
even more robust to initialization. 
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2.1 The Expectation Maximization Type 

Our first family of randomized algorithms carries out an iterative improve- 
ment as found in A:-Means, Expectation Maximization, fuzzy-c-Means 
and /c-Harmonic Means (refer to Fig. 0. The iteration alternates classifica- 



Iterative_Step(C = {ci, . . . , Ck } C U) 

Classification_Step (C) 

For j — 1 , . . . , fc: find Uj — {ui C U \ d{ui, Cj) < d{ui, — 1, . . . , fc} 

new C Reconstruction.Step 

For j = 1, ... ,k‘. new Cj ■<— new estimator of median for Uj 



Fig. 1. Body of the iteration alternating between finding a classification for the data 
given a model, and finding a model given classified data. 



tion of data from a current model (classification step) and model refinement 
from classified data (reconstruction step). We say that is an Expectation 
Maximization type because it follows the alternation between Expectation and 
Maximization. That is. Expectation because in the statistical sense we estimate 
the hidden information (the membership to clusters of the data items). Maxi- 
mization, because we use some criteria (like Maximum Likelihood) to revise the 
description (model) of each cluster. 

Our task now is to describe the iterative algorithms that minimizes the non- 
crisp classification loss function with respect to a vector uj. The algorithm has 
the generic structure of iterative algorithms. It will start with a random set C°. 
The t-th iteration proceeds as follows. First we note that non-crisp classification 
is computationally as costly as classification in the crisp algorithms. It essentially 
requires to compute SORT of the vector of distances of the data item m under 
consideration to each member of the current set of representatives C*. This 
requires computations of d(-, •) for the k representatives. Although sorting of k 
items requires fclogA: comparisons, it does not require any more dissimilarity 
computations and in the applications we have in mind, the computations of 
dissimilarity values is far more costly than the time required to sort the k values. 
Thus, classification is performed by a simple pass through the data and 0{nk) 
dissimilarity computations. This results in a labeling of all data items in U by 
a vector ranking the degrees of membership to the k clusters. That is, for each 
Ui, we record the rank of d{ui,c*), where c* is the j-th current representative. 
We let rank‘[j] denote the rank of Ui with respect to the j-th representative at 
the t-th. classification. For example, if the first current representative is the 3rd 
nearest representative to Ui then rank*[1] = 3. 

The reconstruction step computes new representatives. For each old repre- 
sentative, we have a degree of membership of every data item. Thus, for each j = 
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1, . . . , fc we seek a new representative that minimizes fj{x) = ®RANK*[j] ' 

ujd{x, Ui) — X)r=i ‘^RANK*[j]'^(2;) xLi). Thus, for example, if an item Ui was ranked 
first with respect to the j-th representative because it was its nearest represen- 
tative, then the distance d{x,Ui) in Equation (0 is multiplied by the largest 
weight in tn (which is the value in the first entry). Also, note that in the case of 
crisp classification, all weights are zero except the largest (and the largest weight 
equals 1). Thus, the minimization of fj(x) above just corresponds to finding the 
median amongst the data items in the j-th cluster. Because of this, the data item 
Ui such that fj{ui) is smallest will be called the j-th discrete weighted median. 

To detail our algorithm further, we must describe how a new representative 
is computed by minimization of fj(x) amongst the data items Ui. The 
algorithm we introduce for this task is a randomized approximation inspired in a 
sub-quadratic randomized algorithm for computation of the discrete median mg. 
For this subproblem of approximating the discrete median we note that U = 
{mi, . . . , Un} is the set of candidates (during the t-th iteration of the algorithm) 
for finding the minimum of each function fj{x), for j = 1, ... ,k. 

However, for simplicity of the notation, in what follows we drop the super- 
index t since it is understood we are dealing with the current iteration. We will 
also assume that we know which representative is being revised and we denote by 
OLD_med(j) the previous approximation to the data item that minimizes fj{x) 
(during the t-th iteration, OLD_MEd(j) is actually Cj). 

Clearly, the discrete weighted median MEd(j) can be computed in 0(||t/|p) 
computations of the dissimilarity £?(•,•) by simply computing fj{x) for x = 
Ui,. . . ,x = Un and returning the x that results in the smallest value. We will 
refer to this algorithm as Exhaustive. It must be used carefully because it has 
quadratic complexity on the size ||?7|| = n of the data. However, it has linear 
complexity of ^(d), the time to compute d(-, •). Thus, our use of randomization. 

The first step consists of obtaining a random partition of U into approxi- 
mately r = y/n subsets Ui, . . . ,Ur each of approximately n/r Ri ^/n elements. 
Then, algorithm Exhaustive is applied to each of these subsets to obtain 
m(j)s = med_d(C/s), s = l,...,r. These r items constitute candidates for the 
j-th discrete weighted median of U. We compute fj{m{j)s) for s = 1, . . . , r and 
also /(old_MEd(j)). The item that provides the smallest amongst these (at most 
r -b 1) items is returned as the new approximation to the discrete median. The 
algorithm has complexity O(0(d)||C/||-\/||C/||) because Exhaustive is applied to 
0(-\/||C/||) sets, each of size 6>(y^p7|[); thus this requires 0(^(d)||C/||y^|[C/j|) time. 
Finally, fj{m{j)s) requires 0(</)(d)||C/||) time and is performed 0 (a/||C/||) times. 
This is also 0(<()(d) ||[/|| -\/||C/||) time. These types of randomized algorithms for 
finding discrete medians have been shown mathematically and empirically to 
provide robust estimators of location m- This is because randomization is not 
sampling. Randomization uses the entire data available. 

We enhance the fundamental results of iterative clustering algorithms by 
proving that our algorithm converges. We prove this by showing that both steps, 
the non-crisp classification step and the reconstruction step never increase the 
value of the objective function in Equation 0. 
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Lemma 1. Let rank* be the rank vector for each Ui € U (i = 1, ... ,n) after 
the non-crisp classification step in the t-th iteration of the algorithm. Let 

. n k 

= Tj— IP <^RANK* 11] d{Uj , C^- ) . (3) 

II 111 j^l 

Then, the value of objective function 

after the reconstruction step is no larger than 

Proof. First note that, by expanding the dot product cj'^SORt(-) and using the 
RANK vector we have that Equation Q and Equation 0) are the same. Then, we 
can reverse the order of the summation signs and also note that l/||a;||i is a con- 
stant. Thus, the objective function is simply M“(C*) = The 

reconstruction step finds new c*^^ such that fjic*'^^) < fj{Cj), for j = 1 , . . . , fc 
(because the previous discrete median is considered among the candidates for a 
new weighted discrete median). Thus, 

^ n k 

> TT^X!X!*^R'ANK*b]<^(Wi,C*+^). 
i=l j=l 



Lemma 2. The value 

M“(C*+i = {c*+\...,c*+^}) 

1 " 

after a classification step is no larger than Sr=i '^RANK*[j]'^('f^jj 

resulting in the previous reconstruction step. 

Proof Note that = Er=i Ei=i ‘^RANKnj]f^(cr^ "**)• 

Thus, we can say that the contribution of Ui to the objective function before 
the next classification is Sj=i ‘^RANK*[i]‘^(''i~''^’ **»)• know that af- 

ter classification, this contribution by Ui is 

T^oj'^SOKT{d{ui, c[+^),d{u^, C2+^), . . . , d{ui, cl+^)). 

Thus, the terms d(iti,c*T^) involved before and after non-crisp classification are 
the same, it is just that after classification they are in non-decreasing order. 
Since has entries in descending order we claim that 

k 

UI SORT(d(ui, ),d(Ui,C2 ))•■•) )) — ^^ ^RANK*[j]^(*"j > 
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whatever the order of the set . . . ,d{ui,c]^^)} given by 

the permutation encoded in rank*[j] (note the rank is the one that resulted in 
the previous classification, and we are about to perform the {t + l)-th classifica- 
tion) . 

We prove this claim by showing that if j < j' and > d{ui,c*y^), 

but d{ui,c*j^^) is ranked earlier than in the t-th ranking, then we 

can reduce the value of the contribution of Ui by swapping the order in the 
permutation of j and j' . 

This is because if j < j' we have ujj > ujj' and d{ui,c*j^^) > d{ui,cP~^) 
implies Wj{d{ui,c*f'~^) - > ujj'{d{ui,c*'^^) - d(ui, 

Thus, ujjd{ui,c*'^^) + ojj'dlui,c^j)'^) > ujj>d{ui,c*j^^) + Ujd{ui,cP'^)). Thus, 
swapping j and j' reduces the contribution of Ui. This proves that the sorted 
array of values reduces the contribution of Ui, and this is for all Ui. Thus, the 
new non-crisp classification produces a ranking that can not increase the value 
of the objective function. 



Theorem 1. Our algorithm converges. 



Proof. The domain of the objective function has size 



since it consists of 



all subsets of size k of U. Thus, the objective function has a finite range. The 
algorithm can not decrease the value of the objective function continuously. 



This result is in contrast to the problem of the continuous median in dimen- 
sions D >2 and the Euclidean metric (the continuous Fermat- Weber problem) 
where fundamental results show that it is impossible to obtain an algorithm 
to converge 0 (numerical algorithms usually halt because of the finite pre- 
cision of digital computers). Other algorithms for non-crisp classification, like 
/c-Harmonic Means have not been shown to converge. 



2.2 The Discrete Hill-Climber Type 



We now present an alternative algorithm to optimizing Equation (|2I). The al- 
gorithm here can be composed with the algorithm in the previous section (for 
example, the first can be the initialization of the latter). The result is an even 
more robust algorithm, still with complexity 0(n^/n) similarity computations. 

The algorithm in this section is an interchange heuristics based on a hill- 
climbing search strategy. However, we require to adapt this to non-crisp classifi- 
cation since all previous versions [iSI9l I I I I 511 9f21)l2 1 !,'■! I j are for the crisp classifi- 
cation case. We will first present a quadratic-time version of our algorithm, which 
we will name non-crisp TaB in honor of Teitz and Bart m original heuristic. 
Later, we will use randomization to achieve subquadratic time complexity, as in 
the previous section. 

Our algorithms explore the space of subsets C C 17 of size k. Non-crisp TaB 
will start with a random set C^. Then, the data items Ui € U are organized 
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in a circular list. Whenever the turn belonging to a data item Ui comes up, if 
Ui ^ C*, it is used to test at most k subsets Cj = {C* U {wi}) \ {c*|. That 
is, if Ui is currently not a representative (medoid), it is swapped with each of 
the k current medoids. The objective function in Equation 0 is evaluated 
in M‘^{Cj), for j = 1, ... ,k and if any of these values is less than the current 
then the swap with the best improvement M‘^{Cjg) is accepted and 
= Cjg. In this case, or if Ui S C* or if no Cj improves C*, the data item Ui 
is placed at the end of the circular list. The turn passes to the next data item 
and the algorithm halts when a pass through the circular list produces no swap. 

The algorithm requires 0{kv?) dissimilarity evaluations because evaluating 
M"^{C) requires 0{n) distance evaluations and at least one pass is made through 
the circular list to halt. 

Our randomized TaB also partitions U into approximately r = ^/ri subsets 
Ui, . . . ,Ur each of approximately njr k, y/n elements. The non-crisp TaB is 
applied to each of Ui,...,Ur- Thus each Ui is clustered into k groups. This 
requires 0{kriy/n) distance calculations and results in r sets of k representatives. 
Then, all the resulting rk representatives are placed in a circular list. Then non- 
crisp TaB is applied in this list but the evaluation of M“(C') is performed with 
respect to the entire data set U. Since the circular list has length ky/n, the 
iteration achieved by the last execution of non-crisp TaB requires 0{k‘^riy/n). If 
having k^ in the complexity is a problem, then we can simply chose r = y/n/k, 
and obtain an algorithm with linear complexity in k as well. 

Clearly, this algorithm also converges. 



3 Case Study 



The literature contains many illustrations in commercial applications where clus- 
tering discovers what are the types of customers m- Recent attention for WEB 
Usage Mining m has concentrated on association rule extraction Em and ref- 
erences]. There has been comparative less success at categorizing WEB- visitors 
than categorizing customers in transactional data. This WEB usage mining task 
is to be achieved from the visitation data to a WEB-site m- The goal is to iden- 
tify strong correlation among users interests by grouping their navigation paths. 
Paths are ordered sequences of WEB pages. Many applications can then benefit 
from the knowledge obtained Discovery of visitor profiles is an 

important task for WEB-site design and evaluation I2SE3I. Other examples are 
WEB page suggestion for users in the same cluster, pre-fetching, personalization, 
collaborative filtering EH] and user communities 123- 

Paths are discrete structures. Several similarity measures have been defined 
but they all correspond to dissimilarity between high-dimensional feature vec- 
tors extracted from the paths [ I til‘i7l2iSI,'-i‘2l,'-i,'^ . Because the length of the path, 
the order of the WEB pages, the time intervals between links and many other 
aspects play a role in dissimilarity measures the resulting clustering problems 
are high dimensional. For example, a measure that has been used for WEB-path 
clustering is defined as follows m Let P = {pi, ... ,pm} be a set of pages, and 
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let the corresponding usage-feature vector USAGE„. of user m defined by 



Then, USAGE(ui, Ui/) = USAGE^yUSAGE^., /||USAGE„J1 ||USAGE„., || is the Usage 
Similarity Measure (the cosine of the angle between the usage-feature vectors). 

Clearly the dimension of the vectors involved is the number m of pages in the 
WEB-site, typically a number higher than 10 and usually much larger. Moreover, 
the Usage Similarity Measure is the simplest of the dissimilarity measures since 
it does not consider order, length or time along a visitation path. Other more 
robust dissimilarity measures are more costly to evaluate and imply feature 
vectors in even higher dimensions. 

Also, the discrete nature of paths removes vector-based operations, like av- 
erages (means) . Thus, while it is possible to compute the average of two feature 
vectors like (USAGE„. -|-Usage„.,)/2, it is not clear that the result is the feature 
vector of a path (a path with such feature vector may actually be infeasible 
given the links between pages). Also, spaces defined by dissimilarity measures 
are different form Euclidean spaces, since for all feature vectors v, the similarity 
between v and itself is maximum (1); however, it is also maximum between v 
and any scalar transformation Xv of the vector itself, for all constants A > 0. 

These two challenges obstruct the use of many clustering algorithms for find- 
ing groups on WEB visitors based on their visitation paths (including fc-MEANS 
and fuzzy-c-Means). The algorithms proposed to date |7I16I27I28I32| are all 
of quadratic time complexity (they essentially require computing the dissimilar- 
ity between all pairs of paths). These clustering efforts, although not scalable, 
have demonstrated the extensive benefits and sophisticated applications emerg- 
ing from identifying groups of visitors to a WEB-site. 

The implementation of 0(n^/n) time results in a dramatic improvement in 
CPU-time resources. Our implementation is much faster than previous matrix- 
based algorithms m- Just for the Usage dissimilarity metric the Matrix-Based 
algorithm requires over 18,000 CPU seconds (5 hrs!) with 590 users while our 
algorithms in crisp mode requires only 83 seconds (just over a minute) and in 
harmonic mode it requires 961 second (16 minutes). These results are on the 
same data set of logs identified with visitor and sessions used by Xiao et al m 
Namely, we used the WEB-log data sets publicly available from the Boston 
University Computer Science Department. 

To evaluate the quality of the clustering synthetic data sets are useful because 
it is possible to compare the results of the algorithms with the true clustering. 
Typically, synthetic data is generated from a mixture or from a set of k represen- 
tatives by perturbing each slightly. The quality of the clustering is reflected by 
the proportion in which the clustering algorithm retrieves groups and identifies 
data items to their original group. 

We reproduced to the best of our ability the synthetic generation suggested 
for the same task of clustering paths used by Shahabi et al m- Our crisp- version 
has already been shown to provide much better results than matrix-based al- 




112 V. Estivill-Castro and J. Yang 



ternatives m for the usage dissimilarity measure. In what follows, we discuss 
results comparing our harmonic EM type algorithm {up- = (1, 1/2, . . . , 1/fc) and 
its crisp EM type version (w"'" = e.J) M- We used the usage-dissimilarity mea- 
sure and other two measures, the frequency measure and the order measure m- 
The order measure is the same as the path measure . Because our algorithms 
start with a random set of representatives, they were run each of them 5 times 
and confidence intervals are reported with 95% accuracy . There are several 
issues we would like to comment on our results m The first is that we were 
surprised that actually what the literature has claimed on the issue of features 
from paths towards dissimilarity measures is not reflected in our results. In fact, 
the more sophisticated the dissimilarity measure, the poorer the results. Quality 
was much more affected by the dissimilarity measure used than by the size of the 
data set or the clustering algorithm used. Our non-crisp algorithm does better, 
but we admit that for this data set the results are not dramatic. In fact, they are 
probably less impressive if one considers the CPU-time requirements jl4j . The 
harmonic version is slower. It requires a factor of 0{k) more space and 0(fclog k) 
administrative work. However, the observed CPU times confirm the 0(n log n) 
nature of our algorithms and that dissimilarity evaluation is the main cost. In 
fact, the usage and frequency dissimilarity functions can be computed in 0{p), 
where p is the average length of paths. However, the order dissimilarity function 
requires f2{p^) time to be computed. 

We note that the confidence intervals for the harmonic version are smaller 
than for the crisp version. This indicates that harmonic is more robust to ini- 
tialization. To explore further this issue we also performed on some initial ex- 
periments regarding the robustness to initialization of our algorithms. The ex- 
periment consisted of evaluating the discrepancy in clustering results between 
independent executions (with a different random seed) of the same algorithm. 
In our results M there is much less variance with the harmonic version. Given 
that the initialization is random, this suggests the algorithm finds high quality 
local optima with respect to its loss function. However, more thorough experi- 
mentation is required to confirm this suggestion. 

We also point out that in parallel to A:-Means, Expectation Maximiza- 
tion, and /c-Harmonic Means, our harmonic EM type algorithm may place 
two representatives very close together attracted by the same peak in frequency. 
However, the theoretical foundation provided here allows also to detect this and 
apply the “boosting” techniques as suggested for /c-Harmonic Means PI- 

4 Final Remarks 

We presented the theoretical framework for sub-quadratic clustering of paths 
with non-crisp classification. The experiments are not exhaustive HD but they 
illustrate that there are benefits to be obtained with respect to quality with 
non-crisp classification. Moreover, they also reflect that there are trade-off to be 
investigated between the complexity of the dissimilarity function and its compu- 
tational requirements. We indicated the possibility of hybridization between the 
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discrete hill-climber type and the EM type. However, because our EM-methods 
are generic on the vector ui they offer a range of diversity for the computational 
requirements in this regard. For example, one can imagine a A:'-nearest neighbor 
classification with k' < k. The vector ui would have entries equal to zero from the 
(fc'-l-l)-th entry onwards. The nonzero entries can then be a harmonic average of 
the k' nearest neighbors or some other combination. Thus, this is a classification 
that incorporates the supervised-learning technique of nearest-neighbors ^ and 
reduces smoothly to crisp-classification with k' closer to 1. 

We have illustrated the new clustering methods with similarity measures 
of interests between WEB visitors. Similarity measures proposed for analysis 
WEB visitation jl tif‘27f2Sp32p,33] pose a high-dimensional non-Euclidean cluster- 
ing problem. This eliminates many clustering methods. Previously we (and 
others m) provided a more detailed discussion of why density-based or hierar- 
chical methods are unsuitable to clustering paths. Our methods here are fast and 
robust. They are applicable to any similarity measure and can dynamically track 
users with high efficiency. Moreover, we have generalized fuzzy-membership or 
probabilistic membership to non-crisp classification. 
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Abstract. In this paper, we propose some new tools to allow machine 
learning classifiers to cope with time series data. We first argue that 
many time-series classification problems can be solved by detecting and 
combining local properties or patterns in time series. Then, a technique 
is proposed to find patterns which are useful for classification. These 
patterns are combined to build interpretable classification rules. Exper- 
iments, carried out on several artificial and real problems, highlight the 
interest of the approach both in terms of interpretability and accuracy 
of the induced classifiers. 



1 Introduction 

Nowadays, machine learning algorithms are becoming very mature and well un- 
derstood. Unfortunately, most of the existing algorithms (if not all) are dedicated 
to simple data (numerical or symbolic) and are not adapted to exploit relation- 
ships among attributes (such as for example geometric or temporal structures). 
Yet, a lot of problems would be solved easily if we could take into account such 
relationships, e.g. temporal signals classification. While a laborious application 
of existing techniques will give satisfying results in many cases (our experiments 
confirm this), we believe that a lot of improvement could be gained by design- 
ing specific algorithms, at least in terms of interpretability and simplicity of the 
model, but probably also in terms of accuracy. 

In this paper, the problem of time signals classification is tackled by means 
of the extraction of discriminative patterns from temporal signals. It is assumed 
that classification could be done by combining in a more or less complex way 
such patterns. For the aim of interpretability, decision trees are used as pat- 
tern combiners. The first section formally defines the problem of multivariate 
time-series classification and gives some examples of problems. Related work in 
the machine learning community is discussed here as well. In the next section, 
we experiment with two “naive” sets of features, often used as first means to 
handle time series with traditional algorithms. The third section is devoted to 
the description of our algorithm which is based on pattern extraction. The last 
section presents experiments with this method. Finally, we conclude with some 
comments and future work directions. 
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2 The (Multivariate) Time-Series Classification Problem 

2.1 Definition of the Problem 

The time series classification problem is defined by the following elements: 

— A universe U of objects representing dynamic system trajectories or scenar- 
ios. Each objedQ o, is observed for some finite period of time [O,t/(o)0 

— Objects are described by a certain number of temporal candidate attributes 
which are functions of object and time, thus defined on U x [0,-|-oo[. We 
denote by a(o, t) the value of the attribute a at time t for the object o. 

— Each object is furthermore classified into one class, c(o) G {ci, 

Given a random sample LS of objects from the universe, the goal of the ma- 
chine learning algorithm is to find a function /(o) which is as close as possible to 
the true classification c(o) . This function should depend only on attribute values 
(not on object), ie. /(o) = /(a(o, .)) where a denotes the vector of attributes. 
The classification also should not depend on absolute time values. A consequence 
of this latter property is that the model should be able to classify every scenario 
whatever its duration of observation. 

Note that alternative problems should also be addressed. For example, in 
ID, a temporal detection problem is defined where the goal is to find a function 
f{o,t) of object and time which can detect as soon as possible scenarios of a 
given class from past and present attribute values only. 

Temporal attributes have been defined here as continuous functions of time. 
However, in practice, signals need to be sampled for representation in computer 
memory. So, in fact, each scenario is described by the following sequence of 
vectors: (a(o, to(o)), a(o, <i(o)), ..., a(o, t„(o))) where ti{o) = i ■ At{o) and i = 
0, 1, • • • , n{o). The number of time samples, n(o), may be object dependent. 

2.2 Description of Some Problems 

The possible application domains for time series classification are numerous: 
speech recognition, medical signal analysis, recognition of gestures, intrusion 
detection... In spite of this, it is difficult to find datasets which can be used to 
validate new methods. In this paper, we use three problems for evaluation. The 
first two datasets are artificial problems used by several researchers in the same 
context. The last one is a real problem of speech recognition. 

Control Chart (CC). This dataset was proposed in to validate clustering 
techniques and used in P) for classification. Objects are described by one tempo- 
ral attribute and classified into one of six possible classes (see PJ or |2| for more 
details). The dataset we use was obtained from the UCI KDD Archive |3| and 
contains 100 objects of each class. Each time series is defined by 60 time points. 

^ In what follows, we will use indifferently the terms scenario and object to denote an 
element of U. 

^ Without loss of generality we assume start time of scenario being always 0. 
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Cylinder-Bell- Funnel (CBF). This problem was first introduced in m and 
then used in miE] for validation. The goal is to separate three classes of object: 
cylinder(c), bell(b) and funnel(f). Each object is described by one temporal 
attribute given by: 

( (6 + ??) • X[a,b] (t) + e{t) if c(o) = c, 

a{o, t)= { (6 + ??) • X[a,b] (t) ■ {t - a)/{b- a) + e{t) if c(o) = b, 

[ (6 + 77) -X[a,b](t) ■ {b-t)/{b-a) + e{t) if c(o) = /, 

where t £ [1, 128] and X[a,b](t) = 1 if a<t< 6, 0 otherwise. 

In the original problem, r] and e(t) are drawn from a standard normal dis- 
tribution fV(0, 1), a is an integer drawn uniformly from [16, 32] and 6 — a is an 
integer drawn uniformly from [32, 96]. Figure ^ shows an example of each class. 
As in nn, we generate 266 examples for each class using a time step of 1 (i.e. 
128 time points per object). 




Fig. 1. An example of each class from the CBF dataset 

This dataset attempts to catch some typical properties of time domain. 
Hence, the start time of events, a, is significantly randomized from one object to 
another. As we will argue later that our algorithm can cope with such temporal 
shifting of events, we generate another version of the same datasets by further 
emphasizing the temporal shifting. This time, a is drawn from [0, 64] and h — a 
is drawn from [32,64]. We will call this dataset CBF-tr (for “CBF translated”). 

Japanese Vowels (JV). This dataset is also available in the UCI KDD Archive 
and was built by Kudo et al. uni to validate their multidimensional curve clas- 
sification system. The dataset records 640 time series corresponding to the suc- 
cessive utterance of two Japanese vowels by nine male speakers. Each object is 
described by 12 temporal attributes corresponding to 12 LPC spectrum coeffi- 
cients. Each signal is represented in memory by between 7 to 29 time points. The 
goal of machine learning is to identify the correct speaker from this description. 

2.3 Related Work in Machine Learning 

Several machine learning approaches have been developed recently to solve the 
time series classification problem. Manganaris HH for example constructs piece- 
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wise polynomial models for univariate signals and then extracts features from 
this representation for classification. Kadous jS) extracts parameterized events 
from signals. These events are clustered in the parameters space and the re- 
sulting prototypes are used as a basis for creating classifiers. Kudo et al. m 
transforms multivariate signals into binary vectors. Each element of this vector 
corresponds to one rectangular region of the space value-time and tells if the 
signal passes through this region. A method of their own, subclass, builds rules 
from these binary vectors. Gonzales et al. |2| extends (boosted) ILP systems 
with predicates that are suited for the task of time series classification. 

All these approaches share some common characteristics. First, authors are 
all interested in getting interpretable rules more than in accuracy. We will give 
a justification for that in the next section. Second, they use some discretization 
techniques to reduce the search spaces for rules (from simple discretization to 
piecewise modeling or clustering). All of them extract rules which depend on 
absolute time value. This makes difficult the detection of properties which may 
occur at variable time position and can be a serious limitation to solve some 
problems (for example the CBF-tr problem). The technique we propose does 
not have this drawback. 



3 Experiments with Naive Sampling and Classical 
Methods 

Before starting with dedicated approaches for time series classification, it is 
interesting to see what can be done with classical machine learning algorithms 
and a naive approach to feature selection. To this end, experiments were carried 
out with two very simple sets of features: 

— Sampled values of the time series at equally spaced instants. To handle time 
series of different durations, time instants are taken relative to the duration 
of the scenario. More precisely, if n is the number of time instants to take 
into account, each temporal attribute a gives rise to n scalar attributes given 
by: o(o,t/(o);^), i = 0, 1, • • • , n - 1. 

— Segmentation (also proposed in [BI). The time axis of each scenario is di- 
vided into n equal-length segments and the average value of each temporal 
attribute along these segments are taken as attributes. 

The two approaches give n ■ m scalar attributes from m temporal ones. Note 
that while the first approach is fully independent of time, the second one takes 
into account the temporal ordering of values to compute their average and so is 
doing some noise filtering. 

These two sets of attributes have been tried on the three problems described 
in Sect. 12.21 a,s inputs to three different learning algorithms: decision tree (the 
particular algorithm we used is described in PSI), decision tree boosting jS] 
and the one-nearest neighbor. Results in terms of error rates are summarized in 
Tabled for increasing value of n. They were obtained by ten-fold cross-validation 
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for the first three problems and by validation on an independent test set of size 
370 for the last on^l (JV). The best result is boldfaced in every row. 



Table 1. Results with simple sampling methods 



cc 


Number of steps 


Sampling 


3 


5 


10 


30 


60 


DT 

Boosting 

1-NN 


31.67 ± 4.94 
24.33 ± 4.67 
31.00 ± 4.72 


17.00 ± 4.40 
12.33 ± 3.96 
17.00 ± 3.14 


11.67 ± 4.08 
5.17 ± 3.20 
8.66 ± 3.78 


8.17 ± 3.53 
2.00 ± 1.63 
1.50 ± 1.38 


6.83 ± 2.29 
1.50 ± 1.17 

1.83 ± 1.38 


Segment 


3 


5 


10 


30 


60 


DT 

Boosting 

1-NN 


12.33 ± 4.73 
6.67 ± 4.01 
8.16 ± 4.24 


16.33 ± 4.46 
11.83 ± 5.45 
12.00 ± 4.46 


3.50 ± 2.52 

1.50 ± 1.38 
0.50 ± 1.07 


7.50 ± 2.50 
2.00 ± 2.08 
1.33 ± 1.00 


6.83 ± 2.29 
1.50 ± 1.17 

1.83 ± 1.38 



CBF 


Number of steps 


Sampling 


8 


16 


32 


64 


128 


DT 

Boosting 

1-NN 


9.33 ± 3.35 
6.50 ± 3.02 
7.66 ± 4.16 


7.83 ± 3.42 
4.00 ± 2.00 

3.83 ± 2.24 


7.50 ± 2.61 
2.33 ± 1.70 
2.00 ± 1.63 


7.33 ± 2.49 
2.17 ± 1.83 

1.33 ± 1.63 


9.83 ± 3.83 
3.50 ± 2.17 
1.16 ± 1.30 


Segment 


8 


16 


32 


64 


128 


DT 

Boosting 

1-NN 


4.67 ± 2.45 
3.17 ± 1.57 
2.33 ± 2.00 


2.67 ± 1.33 
0.67 ± 1.11 
0.50 ± 0.76 


4.33 ± 2.38 
1.67 ± 1.83 
0.50 ± 1.07 


7.67 ± 4.29 
2.17 ± 1.98 
1.00 ± 1.10 


9.83 ± 3.83 
3.50 ± 2.17 
1.16 ± 1.30 



CBF-tr 


Number of steps 


Sampling 


8 


16 


32 


64 


128 


DT 

Boosting 

1-NN 


19.17 ± 3.18 

14.17 ± 3.52 
19.33 ± 3.89 


23.50 ± 6.81 

10.50 ± 4.54 
8.00 ± 4.70 


20.67 ± 3.82 

6.67 ± 2.79 
5.33 ± 2.56 


21.67 ± 3.80 

5.00 ± 2.36 
3.50 ± 2.03 


23.83 ± 6.95 
7.17 ± 3.58 

3.83 ± 2.89 


Segment 


8 


16 


32 


64 


128 


DT 

Boosting 

1-NN 


14.17 ± 5.34 
12.67 ± 5.23 
12.00 ± 2.33 


14.17 ± 4.55 

6.00 ± 3.27 

3.00 ± 2.66 


12.83 ± 5.58 

4.17 ± 1.86 
1.66 ± 1.82 


12.67 ± 3.82 

5.33 ± 2.08 
2.66 ± 1.52 


23.83 ± 6.95 
7.17 ± 3.58 

3.83 ± 2.89 



JV 


Number of steps 


Sampling 


2 


3 


5 


7 




DT 


14.86 


14.59 


19.46 


21.08 




Boosting 


6.76 


5.14 


5.68 


6.22 




1-NN 


3.24 


3.24 


3.24 


3.78 




Segment 


1 


2 


3 


4 


5 


DT 


18.11 


17.30 


12.97 


19.46 


17.03 


Boosting 


9.46 


7.84 


6.76 


6.76 


6.22 


1-NN 


6.49 


3.51 


3.51 


3.78 


4.05 



® This division was suggested by the donors of this dataset [Iflj 
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There are several things to say about this experiment. There exists an opti- 
mal value of n which corresponds to the best tradeoff between bias and variance 
for each problerr0. This optimal value could be automatically fixed by cross- 
validation. Segmentation rather than simple sampling is highly beneficial on all 
datasets except for the last one. The best error rates we get with this simple 
approach are very good with respect to previously published results on these 
problems (i.e. with dedicated temporal approaches). The best method is 1-NN 
on all problems. Decision trees do not work well while boosting is very effective. 
As boosting works mainly by reducing the variance of a classifier, the bad results 
of decision trees may be attributed to a high variance. Furthermore, their inter- 
pretability is also questionable because of the choice of attributes. Indeed, how 
to understand for example a rule like “if a(o, 32) < 2.549 and a(o, 22) < 3.48 
then c(o) = bell” which was induced from the CBF dataset ? Although very 
simple and accurate, this rule does not make obvious the temporal increase of 
the signal peculiar to the bell class (see Fig.[IJ. 

One conclusion of this experiment is that very good results can be obtained 
with simple feature sets but by sacrificing interpretability. This observation jus- 
tifies the fact that most of the machine learning research on temporal data have 
focused on interpretability rather than on accuracy. 



4 Pattern Extraction Technique 

Why are the approaches adopted in the previous section not very appropriate ? 
First, even if the learning algorithm gives interpretable results, the model will 
not be comprehensible anymore in terms of the temporal behavior of the system. 
Second, some very simple and common temporal features are not easily repre- 
sented as a function of such attributes. For example, consider a set of sequences 
of n random numbers in [ 0 , 1 ], { 01 , 02 , ...,o„}, and classify a sequence into the 
class Cl if three consecutive numbers greater than 0.5 can be found in the se- 
quence whatever the position. With a logical rule inducer (like decision trees) 
and using oi, 02 ,...,o„ as input attributes, a way to represent such a classification 
is the following rule: if (oi > 0.5 and 02 > 0.5 and 03 > 0.5) or (02 > 0.5 and 
03 > 0.5 and 04 > 0.5) or ... or (o „_2 > 0.5 and o„_i > 0.5 and o„ > 0.5) 
then return class ci. Although the initial classification rule is very simple, the 
induced rule has to be very complex. This representation difficulty will result in 
a high variance of the resulting classifier and thus in poor accuracy. The use of 
variance reduction techniques like boosting and bagging often will not be enough 
to restore the accuracy and anyway will destroy interpretability. 

In this paper, we propose to extend classifiers by allowing them to detect 
local shift invariant properties or patterns in time-series (like the one used to 
define the class ci in our example). The underlying hypothesis is that it is pos- 
sible to classify a scenario by combining in a more or less complex way such 
pattern detections. In what follows, we first define what patterns are and how 



^ for a comprehensive explanation of the bias/variance dilemna, see for example m 
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to construct binary classification tests from them. Then, we propose to com- 
bine these binary tests into decision trees. In this context, piecewise constant 
modeling is proposed to reduce the search space for candidate patterns. 

Pattern Definition. A possible way to define a pattern is to use a limited 
support reference signal and then say that the pattern is detected at a particular 
position of a test signal if the distance between the reference signal and the test 
signal at this position is less than a given threshold. In other words, denoting 
by p{.) a signal defined on the interval [0,tp] and by a(.) a signal defined on the 
interval [0,ta] with ta > tp, we would say that the pattern associated to p{.) is 
detected in a{.) at time t' {tp < t' < ta) if: 



where dp is the minimal allowed distance to the pattern (euclidian distance). A 
binary classification rule may be constructed from this pattern by means of the 
following test: 



where a is a temporal attribute. 

Integration with Decision Trees. As we are mainly interested in inter- 
pretable classifiers, the way we propose to combine these binary tests is to let 
them appear as candidate tests during decision tree induction. Each step of de- 
cision tree induction consists in evaluating a set of candidate tests and choosing 
the one which yields the best score to split the node (see m for more details). 
In the present context, candidate tests are all possible triplets {a,p,dp) where a 
is a temporal candidate attribute. 

Once we have chosen an attribute a and a pattern p, the value of dp which 
realizes the best score may be computed similarly as the optimum discretization 
threshold for numerical attribute. Indeed, test © is equivalent to a test on the 
new numerical attribute a„(o) = minj/ d{p{.),a{o, .),t'). 

The number of candidate patterns p{.) could be a priori huge. So, it is neces- 
sary to reduce the search space for candidate patterns. A first idea is to construct 
patterns from subsignals extracted from the signals appearing in the learning set 
(each one corresponding to temporal attributes of objects). However, it is still 
impossible in practice to consider every such subsignal as a candidate pattern. 
Even assuming a discrete time representation for the datasets, this step will re- 
main prohibitive (e.g. there are 8256 different subseries in a time series of 128 
points). Also, patterns extracted from raw signals may be too complex or too 
noisy for interpretation. The solution adopted here is to first represent the signal 
by some piecewise model and then use the discontinuity points of this model to 
define interesting patterns. By choosing the complexity of the model (the number 
of time axis pieces), we are thus able to fix the number of patterns to consider. 




T Jt'-tp 



T{o) = True O 3t' G [tp, t/(o)], : d{p{.),a{o, -),t') < dp 
o[ min d{p{.),a{o, .),t')] < dp 



O min 

t'e[tp,tf{o)] 



(2) 

(3) 
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Table 2. Discretization of time signals by regression trees 



Let us denote by mean[tj^t 2 ](a(.)) and varftj^t^] (“(•)) respectively the mean and the 
variance of the signal a{.) on the interval 



To discretize a(.) on [ti,t 2 ] with Nmax pieces: 

1. set D — {ti,t 2 }, the set of discontinuity points; set L — {[t\,t 2 \\, the set of 
intervals; set a{t) = meari[t^,t 2 ]o-{-) on [ti,t 2 ], the model for a(.); 

2. set Np = 1, the current number of time segments (pieces); 

3. if Np = Nmax then stop and return &{.); 

4. find in L such that (tj — ti).var[t._f^.] (a(.)) is maximal (best first strategy), 

5. remove from L, 

6. find t* £ which maximizes the variance reduction: 

Avar(t*) = (G - L)var[t.,t^.](a(.)) - (t* - L)var[t,,t.] (a(.)) - {tj - t*)var[t._t^.j(a(.)) 

7. set a{t) = meari[t ._t»ja(.) on and a{t) = mean[t» ,t on [t* ,tj\\ 

8. Np = Np + 1; add and [t* i-o add t* to D. 

9. go to step 3 

Regression Tree Modeling. In this paper, a simple piecewise constant model 
is computed for each signal. Regression trees are used to build recursively this 
model. The exact algorithm is described in Table El It follows a best first strategy 
for the expansion of the tree and the number of segments (or terminal nodes) 
is fixed in advance. The discretization of an example signal in 5 segments is 
reproduced in the left part of Fig. E| 

From a discretized signal a(.), the set of candidate signals p{.) is defined as 
follows: 




var[ti,t2](a(.)) = J-j- 



•t2 

(a(t) - mean[t^_i2](a(.)))^dt 



P = {p(.) on [0,tj - U]\U,tj £ D,ti < = a{U +t)}. 




Fig. 2. Left, the regression tree modeling of a signal with 5 intervals. Right, the detec- 
tion of a pattern extracted from this signal in another signal of the same class 
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where D is the set of discontinuity points defined in Table 0 The size of this set 
is n • (n + l)/2 if n is the number of segments. The right part of Fig. 0 shows 
a pattern extracted from the left signal and its minimal distance position in 
another signal. 



Node Splitting for Tree Growing. So, candidate signals p{.) during node 
splitting will be extracted from piecewise constant modeling of learning set time 
series. Unfortunately, even with this discretization/segmentation, it will be in- 
tractable to consider every subsignals in the learning set, especially when the 
learning set is large. A simple solution to overcome this difficulty is to randomly 
sample a subset of the scenarios from the learning set as references for defining 
the subsequences. In our experiments, one scenario will be drawn from each class. 
This further simplification should not be limitative because interesting patterns 
are patterns typical of one class and these patterns (if they exist) will presum- 
ably appear in every scenario of the class. Eventually, our final search algorithm 
for candidate tests when splitting decision tree nodes is depicted in Table 0 



Table 3. Search algorithm for candidate tests during tree growing 
For each temporal attribute a, and for each class c: 

— select an objet o of class c from the current learning set, 

— discretize the signal a{o, .) to obtain d(o, .) 

— compute the set P of subsignals p{.) from a{o,t) 

— for each signal p{.) G P 

— compute the optimal threshold dp, 

— if the score of this test is greater than the best score so far, retain the triplet 
{a{.),p{.),dp) as the best current test. 



5 Experiments 

We first experiment with the pattern extraction technique described in the pre- 
vious section. Then, as a byproduct of regression tree modeling is a reduction of 
the space needed to store the learning set, we further test its combination with 
the nearest neighbor algorithm. 



5.1 Decision Tree with Pattern Extraction 

Experiments have been carried out in exactly the same test conditions as in 
Sect. 0 For regression tree modeling, increasing values of the number of time 
segments, Nmax, were used (11 only for CC). Results are summarized in Table 0 
and commented below. 
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Table 4. Results of decision tree with patterns 





Number of pieces 


DB 


3 


5 


7 


11 


CC 

CBF 

CBF-TR 

JV 


2.33 ± 1.70 

4.00 ± 2.71 

9.33 ± 5.68 
22.97 


3.17 ± 2.03 
2.00 ± 1.45 
3.83 ± 2.24 
21.62 


3.33 ± 2.58 

1.17 ± 1.67 

2.33 ± 1.33 
19.4 


3.00 ± 1.63 



Accuracy. On the first three datasets, the new method gives significant im- 
provements with respect to decision tree (compare with Table 0 . As expected, 
the gain in accuracy is especially impressive on the CBF-tr problem (from 12.67 
to 2.33). This problem is also the only one where our temporal approach is better 
than boosting with simple features. On JV, our approach does not improve ac- 
curacy with respect to naive sampling. Several explanations are possible. First, 
this is the only dataset with more than one attribute and our method is not able 
to capture properties distributed on several signals. Second, there are 9 classes 
and only 270 examples in this problem and the recursive partitioning of decision 
tree is known to suffer in such conditions. Third, it seems also that the temporal 
behavior is not very important in this problem, as 1-NN with only two values 
(the start and end values of each attribute) gives the best results (3.24 %). 

From this experiment, the optimal number of segments appears to be prob- 
lem dependent. In practice, we would thus need a way to tune this parameter. 
Besides cross-validation, various methods have been proposed to fix the number 
of segments for piecewise modeling (for example, the MDL principle in [I Iji. 
In our case, we could also take advantage of the pruning techniques (or stop- 
splitting criteria) in the context of regression tree induction. We have still to 
experiment with these methods. 

Interpretability. By construction, the rules produced by our algorithm are very 
readable. For example, a decision tree induced from 500 examples of the CBF 
problem gives the very simple rules described visually at Fig. 0 The extracted 
patterns are confirmed by the definition of the problem. This decision tree gives 
an error rate of 1.3% on the 298 remaining examples. For comparison, a decision 
tree built from the mean values on 16 segments (the features set which yields 
the best result in Table 0) contains 17 tests and gives an error rate of 4.6%. 




P2 



p3 




if pi then funnel 

else if p2 and p3 then bell 

else if p2 and not p3 then funnel 

otherwise cylinder 



Fig. 3. classification rules for the CBF problems 
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5.2 Regression Tree Modeling with 1-NN 



As the discretization by regression trees yields a compact version of the original 
time-series, it would be interesting to combine it into a nearest neighbor classifier. 
The main advantage will be a reduction of the space needed to memorize the 
learning set. The algorithm proceeds as follows. First, all signals in the learning 
set are discretized by regression trees using the same maximum number of pieces. 
Then, the distance between a test object o and an object o' of the learning set 
is defined by: 



d(o, o') = 



rmin(tf(o),tf(o')) 



min(tf(o),tf(o')) 



(oi(o, t) - dt(o', t))‘^dt. (4) 



So, discretized signals di{o' , t) for learning set objects are compared to full signals 
for test objects. To deal with objects which are defined on different intervals, we 
simply truncate the longest one to the duration of the shortest one. 

Experiments have been carried out with increasing values of the number of 
time segments (from 1 to 11). Results are reported in Table 0 On the first three 
problems, the accuracy is as good as the best accuracy which was obtained 
in Table 0 and the number of time segments to reach this accuracy is very 
small. The compression of the learning set is particularly impressive on the CC 
dataset where only three values are enough to reach an almost perfect accuracy. 
On the other hand, regression tree modeling decreases the accuracy on JV with 
respect to 1-NN and only two values per attribute. Again, the optimal number of 
pieces is problem dependent. In the context of 1-NN, leave-one-out is an obvious 
candidate method to determine this parameter. 



Table 5. Results of 1-NN with regression tree modeling 





Number of pieces 


DB 


1 


3 


5 


7 


11 


CC 

CBF 

CBF-tr 

JV 


38.83 ± 9.40 
46.17 ± 6.15 
43.00 ± 5.57 
11.35 


0.50 ± 0.76 
10.50 ± 2.69 
23.00 ± 7.18 
5.67 


0.17 ± 0.50 

1.33 ± 1.45 
4.50 ± 3.58 
4.86 


0.33 ± 1.00 
0.50 ± 0.76 

2.33 ± 1.70 
4.59 


0.33 ± 0.67 

0.33 ± 0.67 

2.50 ± 1.86 
4.59 



6 Conclusion and Future Work 

In this paper, we have presented a new tool to handle time series in classification 
problems. This tool is based on a piecewise constant modeling of temporal signals 
by regression trees. Patterns are extracted from these models and combined in 
decision trees to give interpretable rules. This approach has been compared to 
two “naive” feature selection techniques. The advantage of our technique in 
terms of interpretability is undeniable. In terms of accuracy, better results can 
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be obtained by using either boosting or 1-NN with naive features. However, in 
some problems where start time of characteristic events are highly variable (like 
in CBF-tr), accuracy can be improved by pattern extraction. Eventually, even 
if our main goal was interpretability, our extended decision trees can also be 
combined in boosted classifiers where they are very unlikely to destroy accuracy 
with respect to the naive feature selection. 

In the future, we will consider extensions of our method along several axis. 
First, there are still many possible improvements of the pattern extraction al- 
gorithm. For instance, we can experiment with other piecewise models (linear 
by hinges model, polynomial,...) or with more robust sampling strategies during 
node splitting. As already mentioned, we also need a way to automatically adapt 
the number of time segments during tree growing. Second, one limitation of our 
pattern definition is that it does not allow the detection of shrunk or extended 
versions of the reference pattern along the time axis. Several distances have been 
proposed to circumvent this problem, for example dynamic time warping HOj 
or probabilistic pattern matching [0|. We believe that such distances could be 
combined with our approach but at the price of a higher complexity. Eventually, 
there exist many problems where the exact ordering of patterns appearing in 
signals is crucial for classification. In these cases, the combination of patterns 
by simple logical rules would not be enough and dedicated methods should be 
developed which could take into account temporal constraints between patterns. 
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Abstract. We present a way of exploiting domain knowledge in the 
design and implementation of data mining algorithms, with special at- 
tention to frequent patterns discovery, within a deductive framework. 
In our framework domain knowledge is represented by deductive rules, 
and data mining algorithms are constructed by means of iterative user- 
defined aggregates. Iterative user-defined aggregates have a fixed scheme 
that allows the modularization of data mining algorithms, thus providing 
a way to exploit domain knowledge in the right point. As a case study, 
the paper presents user-defined aggregates for specifying a version of 
the apriori algorithm. Some performance analyses and comparisons are 
discussed in order to show the effectiveness of the approach. 



1 Introduction and Motivations 

The problem of incorporating data mining technology into query systems has 
been widely studied in the current literature 11:^111141101 . In such a context, the 
idea of integrating data mining algorithms in a deductive environment m is 
very powerful, since it allows the direct exploitation of domain knowledge within 
the specification of the queries, the specification of ad-hoc interest measures 
that can help in evaluating the extracted knowledge, and the modelization of 
the interactive and iterative features of knowledge discovery in a uniform way. 
However, the main drawback of a deductive approach to data mining query 
languages concerns efficiency: a data mining algorithm can be worth substantial 
optimizations that come both from a smart constraining of the search space, and 
from the exploitation of efficient data structures. The case of association rules is 
a typical example of this. Association rules are computed from frequent itemsets, 
that actually can be efficiently computed by exploiting the apriori property HSI, 
and by speeding-up comparisons and counting operations with the adoption of 
special data structures (e.g., lookup tables, hash trees, etc.). Detailed studies P 
have shown that a direct specification of the algorithms within a query language 
lacks of performance effectiveness. 

A partial solution to this problem has been proposed in mag. In these 
approaches, data mining algorithms are modeled as “black boxes” integrated 
within the system. The interaction between the data mining algorithm and the 
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query system is provided by defining a representation formalism of discovered 
patterns within the language, and by collecting the data to be mined in an ad-hoc 
format (a cache), directly accessed by the algorithm. However, such a decoupled 
approach has the main drawback of not allowing the tuning of the search on the 
basis of specific properties of the problem at hand. As an example, using black 
boxes we cannot directly exploit domain knowledge within the algorithm, nor 
we can “on-the-fiy” evaluate interest measures of the discovered patterns. 

The above considerations yield an apparent mismatch: it is unfeasible to 
specify directly and implement data mining algorithms using the query lan- 
guage itself, and by the converse it is inconvenient to integrate data mining al- 
gorithms within query languages as predefined modules. In this paper we propose 
to combine the advantages of the two approaches in a uniform way. Following 
the approach of we adopt aggregates as an interface to mining tasks in a 
deductive database. Moreover, data mining algorithms are specified by means 
of iterative user-defined aggregates, i.e., aggregates that are computed using a 
fixed scheme. Such a feature allows to modularize data mining algorithm and 
integrate domain knowledge in the right points, thus allowing crucial domain- 
oriented optimizations. 

On the other side, user-defined predicates can be implemented by means of 
hot-spot refinements PH. That is, we can extend the deductive databases with 
new data types, (like in the case of object-relational data systems), that can be 
efficiently accessed and managed using ad-hoc methods. Such data types and 
methods can be implemented by the user-defined predicates, possibly in other 
programming languages, with a reasonable trade-off between specification and 
efficient implementation. 

The main advantages of such an approach are twofold: 

— on the one side, we maintain an adequate declarative approach to the spec- 
ification of data mining algorithms. 

— on the other side, we can exploit specific (physical) optimizations improving 

the performance of the algorithms exactly where they are needed. 

As a case study, the paper presents how such a technique can be used to 
specify a version of the apriori algorithm, capable of taking into account domain 
knowledge in the pruning phase. We recall the patterns aggregate defined in 0, 
and provide a specification of such an aggregate as an iterative user-defined 
aggregate. Hence, we provide an effective implementation of the aggregate by 
exploiting user-defined predicates. 

The paper is organized as follows. Section]^ introduces the notion of iterative 
user-defined aggregates, and justifies their use in the specification of data mining 
algorithms. In Sect. El we introduce the patterns iterative aggregate for min- 
ing frequent itemsets. In particular, we show how user-defined predicates can be 
exploited to efficiently implement the aggregate. Finally, in Sect. 0 some perfor- 
mance analyses and comparisons are discussed in order to show the effectiveness 
of the approach. 
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2 Iterative User-Defined Aggregates 

In cn we formalize the notion of logic-based knowledge discovery support envi- 
ronment, as a deductive database programming language that models inductive 
rules as well as deductive rules. Here, an inductive rule provides a smooth in- 
tegration of data mining and querying. In m we propose the modeling of an 
inductive rule by means of aggregate funetions. The capability of specifying (and 
efficiently computing) aggregates is very important in order to provide a basis 
of a logic-based knowledge discovery support environment. To this purpose, the 
Datalog-I— I- logic-based database language imm provides a general frame- 
work for dealing with user-defined aggregates. We use such aggregates as the 
means to introduce mining primitives into the query language. 

In general, a user-defined aggregate Ea is defined as a distributive aggregate, 
i.e., a function / inductively defined over a (nondeterministically sorted) set S: 



We can directly specify the base and inductive cases, by means of ad-hoc user- 
defined predicates single, multi and return, used implicitly in the evaluation 
of the aggregate rule 



In particular, single(aggr, X, C) associates to the first tuple X in the nondeter- 
ministic ordering a value, according to o, and multi(aggr, Old, X, New) com- 
putes the value of the aggregate aggr associated to the current value X in the 
current ordering, by incrementally computing it from the previous value, accord- 
ing to l|2). 

However, in order to define complex aggregation functions, (such as mining 
functions), the main problem with the traditional user-defined aggregate model is 
the impossibility of defining more complex forms of aggregates than distributive 
ones. In many cases, even simple aggregates may require multiple steps over 
data in order to be computed. A simple way of coping with the problem 
of multiple scans over data can be done by extending the specification of the 
aggregation rule, in order to impose some user-defined conditions for iterating 
the scan over data. The main scheme shown in H2I requires that the evaluation 
of the query p(vj, . . . , Vj,, v) is done by first compiling the above program, and 
then evaluating the query on the compiled program. In the compiling phase, the 
program is rewritten into an equivalent, fixed rule scheme, that depends upon 
the user-defined predicates single, multi and return. 

In [Kil l I j we slightly modify such rewriting, by making the scheme dependent 
upon the new iterate user-defined predicate. Such predicate specifies the con- 
dition for iterating the aggregate computation: the activation (and evaluation) 
of such a rule is subject to the successful evaluation of the user-defined predicate 
iterate, so that any failure in evaluating it results in the termination of the 
computation. 



/(W) = g{x) 

f{SU{x}) = h{f{S),x) 



( 1 ) 

( 2 ) 



p(Ki, . . . , Km, aggr(X)) C- Rule body. 



Specifying Mining Algorithms with Iterative User-Defined Aggregates 131 



Example 1. The computation of the absolute deviation Sn = 'Yhx ~ 2 ;| of a 
set of n elements needs at least two scans over the data. Exploiting iterate 
predicate, we can define S'„ as a user-defined predicate: 

single(abserr, X, (nil, X, 1)). 
multi(abserr, (nil, S, C), X, (nil, S -|- X, C -|- 1)). 
multi(abserr, (M, D), X, (M, D -h (M - X))) ^ M > X. 

multi(abserr, (M, D), X, (M, D -h (X - M))) ^ M < X. 

iterate(abserr, (nil, S, C), (S/C, 0)). 
freturn(abserr, (M, D), D). 

The combined use of multi and iterate allows to define two scans over the 

data: the first scan is defined to compute the mean value, and the second one 

computes the sum of the absolute difference with the mean value. <1 

Although the notion of iterative aggregate is in some sense orthogonal to 
the envisaged notion of inductive rule HH. the main motivation for introducing 
iterative aggregates is that the iterative schema shown above is common in many 
data mining algorithms. Usually, a typical data mining algorithm is an instance 
of an iterative schema where, at each iteration, some statistics are gathered 
from the data. The termination condition can be used to determine whether the 
extracted statistics are sufficient to the purpose of the task (i.e., they determine 
all the patterns), or whether no further statistics can be extracted. 

Hence, the iterative schema discussed so far is a good candidate for specify- 
ing steps of data mining algorithms at low granularity levels. Relating aggregate 
specification with inductive rules makes it easy to provide an interface capa- 
ble of specifying source data, knowledge extraction, background knowledge and 
interestingness measures. Moreover, we can specify the data mining task un- 
der consideration in detail, by exploiting ad-hoc definitions of single, multi, 
iterate and return iterative user-defined predicates. 

3 The patterns Iterative Aggregate 

In the following, we concentrate on the problem of mining frequent patterns 
from a dataset of transactions. We can integrate such mining task within the 
datalog-|-+ database language, by means of a suitable inductive rule. 

Definition 1 ([S|). Given a relation r, the patterns aggregate is defined hy 
the rule 



p(Xi, . . . ,Xn,patterns((min_supp, Y))) ^ r(Zi,...,Z„) (3) 

where the variables Xi, . . . , Xn, Y are a rearranged subset of the variables Zi, . . . , Zj, 
of r, min_supp is a value representing the minimum support threshold, and the 
Y variable denotes a set of elements. The aggregate patterns computes the set 
of predicates p(ti, . . . , t^, (s, f )) where: 
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1. ti, . . . , tn are distinct instances of the variables Xi, . . . , X^, as resulting from 
the evaluation of r; 

2. s = {li,...,lic}isa subset of the value of Y in a tuple resulting from the 
evaluation of r; 

3. / is the support of the set s, such that / > minsupp. 

□ 

We can provide an explicit specification of the patterns aggregate in the 
above definition as an iterative aggregate. That is, we can directly implement an 
algorithm for computing frequent patterns, by defining the predicates single, 
multi, return and iterate. 

The simplest specification adopts the purely declarative approach of gener- 
ating all the possible itemsets, and then testing the frequency of the itemsets. 
It is easy to provide such a naive definition by means of the iterative schema 
proposed in Sect.|2 

single(patterns, (Sp, S), ((Sp, 1), IS)) T- subset(lS, S). 

multi(patterns, ((Sp, N), _), (Sp, S), ((Sp, N + 1), IS)) •!— subset(lS, S). 
multi(patterns, ((Sp, N), IS), _, (Sp, IS)). 

multi(patterns, (Sp, IS, N), (_, S), (Sp, IS, N + 1)) C— subset(lS, S). 
multi(patterns, (Sp, IS, N), (_, S), (Sp, IS, N)) -■subset(lS, S). 

iterate(patterns, ((Sp, N), IS), (Sp x N, IS, 0)). 

f return(patterns, (Sp, IS, N), (IS, N)) •<— N > Sp. 

Such a specification works with two main iterations. In the first iteration 
(first three rules), the set of possible subsets are generated for each tuple in the 
dataset. The iterate predicate initializes the counter of each candidate itemset, 
and activates the computation of its frequency (performed by the remaining 
multi rules). The computation terminates when all itemsets frequencies have 
been computed, and frequent itemsets are returned as answers (by mean of the 
f return rule). Notice that the f return predicate defines the output format for 
the aggregation predicate: a suitable answer is a pair (itemset, N) such that 
Itemset is an itemset of frequency N > Sp, where Sp is the minimal support 
required. 

Clearly, the above implementation is extremely inefficient, since it checks the 
support of all the possible itemsets. More precisely, the aggregate computation 
generates 21-^1 sets of items, where / is the set of different items appearing in 
the tuples considered during the computation. As a consequence, no pruning 
strategy is exploited; namely, unfrequent subsets are discarded at the end of the 
computation of the frequencies of all the subsets. Moreover, no optimized data 
structure, capable of speeding-up the computation of some costly operations, is 
used. 

A detailed analysis of the Apriori algorithm 1 1 bj shown in Fig. Q however, 
suggests a smarter specification. Initially, the algorithm computes the candidate 
itemsets of size 1 {init phase: step 1). The core of the algorithm is then a loop. 
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Algorithm Apriori(H, cr); 

Input: a set of transactions 0, a support threshold 
Output: a set Result of frequent itemsets 
Method: let initially Result = 0, fc = 1. 

1. C\ — {n|ci G 

2. while Cfc ^ 0 do 

3. foreach itemset c G Ck do 

4. snpp(c) = 0; 

5. foreach b ^ B do 

6. foreach c G Cfe such that c G b do supp{c) + +; 

7. Lfc := {c G Cfcl supp(c) > (t}; 

8. Result := Result U Lk‘, 

9. Ck+i •= {ciUcj|cj,Cj G LfcA|cjUCj| = fc+1 AVc C Ucj such that \c\ — k : c ^ Lk}', 

10. fc fc + 1; 

11. end while 



Fig. 1. Apriori Algorithm for computing frequent itemsets 



where the k-th iteration examines the set Ck of candidate itemsets of size k. 
During such an iteration the occurrences of each candidate itemset are computed 
scanning the data {count phase: steps 5-6). Unfrequent itemsets are then dropped 
{prune phase: step 7), and frequent ones are maintained in Lfc. By exploiting the 
subset-frequency dependance, candidate itemsets of size fc -I- 1 can be built from 
pairs of frequent itemsets of size k differing only in one position {enhance phase: 
step 9). Finally, Result shall contain IJfc {itemsets phase: step 8). 

By exploiting iterative aggregates, we can directly specity all the phases of 
the algorithm. Initially, we specify the init phase, 

single(patterns, (Sp, S), ((Sp, 1), IS)) single_isets(S, IS). 

multi(patterns, ((Sp, N), IS), (Sp, S), ((Sp, N 4- 1), ISS)) singIe_isets(S, SS), 

union(SS, IS, ISS). 

The subsequent iterations resemble the steps of the apriori algorithm, that is 
counting the candidate itemsets, pruning unfrequent candidates and generating 
new candidates: 

iterate(patterns, ((Sp,N), S), (Sp x N, S)). 
iterate(patterns, (Sp, S), (Sp, SS)) prune(Sp, S, IS), 

generate_candidates(lS, SS). 

multi(patterns, (Sp, IS), (_, S), (Sp, ISS)) •<— count_isets(lS, S, ISS). 
f return(patterns, (Sp, ISS), (IS, N)) ^ member((lS, N), ISS) ,N>Sp. 

Such an approach exploits a substantial optimization, by avoiding to check a 
large portion of unfrequent itemsets. However, the implementation of the main 
operations of the algorithm is demanded to the predicates singe_isets, prune, 
generate_candidates and count_isets. As a consequence, the efficiency of the 
approach is parametric to the efficient implementation and evaluation of such 
predicates. 
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3.1 Exploiting User-Defined Predicates 

In order to support complex database applications, most relational database 
systems support user-defined functions. Such functions can be invoked in queries, 
making it easier for developers to implement their applications with significantly 
greater efficiency. The adoption of such features in a logic-based system provides 
even greater impact, since they allow a user to develop large programs by hot- 
spot refinement 0. The user writes a large datalog-l— I- program, validates its 
correctness and identifies the hot-spots, i.e., predicates in the program that are 
highly time consuming. Then, he can rewrite those hot-spots more efficiently in 
a procedural language, such as C-|— k, maintaining the rest of the program in 
datalog-k- k. 

The CDC-\—\- jl Yll tij implementation of datalog-k-k allows the definition of 
external predicates written in C-k-k, by providing mechanisms to convert objects 
between the C'DC-\—\- representation and the external representations. The ad- 
hoc use of such mechanisms reveals very useful to provide new data types inside 
the CDC-\—\- model, in the style of Object-relational databases. For example, a 
reference to a C-k-k object can be returned as an answer, or passed as input, and 
the management of such a user-defined object is demanded to a set of external 
predicates. 

We adopt such a model to implement hot-spot refinements of frequent item- 
sets mining. In the following we describe the implementation of an enhanced 
version of the Apriori algorithm, described in Fig.^ by means user-defined pred- 
icates. In practice, we extend the allowed types of the £T>£-k-k system to include 
more complex structures, and provide some built-in predicates that efficiently 
manipulate such structures: 



single(patterns, (Sp, S), ((Sp, 1), T)) T- init(S, T). 

multi(patterns, ((Sp, N), T), (Sp, S), ((Sp, N H- 1), T)) ^ init(S, T). 
iterate(patterns, ((Sp, N), T), (Sp x N, T)) ^ prune(Sp, T), enhance(T) 

multi(patterns, (Sp, T), (_, S), (Sp, T)) •<— count(S,T). 

iterate(patterns, (Sp, T), (Sp, T)) prune(Sp, T), enhance(T) 

freturn(patterns, (Sp, T), (l, S)) T- itemset(T, (l, S)). 

In such a schema, the variable T represents the reference to a structure of type 
Hash- Tree ESI, which is essentially a prefix-tree with a hash table associated to 
each node. An edge is labelled with an item, so that paths from the root to an 
internal node represent itemsets. Figure El shows some example trees. Each node 
is labelled with a tag denoting the support of the itemset represented by the 
path from the root to the node. An additional tag denotes whether the node 
can generate new candidates. The predicates init, count, enhance, prune and 
itemset are user-defined predicates that implement, in C-k-k, complex operators, 
exemplified in Fig. El over the given hash-tree. More specifically: 
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Fig. 2. a) Tree initialization, b) pruning, c) tree enhancement and counting, d) pruning, 
e) tree enhancement, f) cutting 



— The init(l,T) predicate initializes and updates the frequencies of the 1- 
itemsets available from I in T (Fig.Et^). For each item found in the transac- 
tion I, either the item is already into the tree (in which case its counter is 
updated), or it is inserted and its counter set to 1. 

— The count(I,T) predicate updates the frequencies of each itemset in T ac- 
cording to the transaction I (Fig.l^t). We define a simple recursive procedure 
that, starting from the first element of the current transaction, traverses the 
tree from the root to the leaves. When a leaf at a given level is found, the 
counter is incremented. 

— The prune(M, T) predicate removes from T all the itemsets whose frequencies 
are less than M (Figs. and d). Leaf nodes at a given depth (representing 
the size of the candidates) are removed if their support is lower than the 
given threshold. 

— The enhance(T) predicates combines the frequent fc-itemsets in T and gener- 
ates the candidate k+ 1-itemsets. New candidates are generated in two step. 
In the first step, a leaf node is merged with each of its siblings, and new sons 
are generated. For example in Fig. Et, the node labelled with beer — chips is 
merged with its sibling beer — wine, generating the new node labelled with 
beer — chips — wine. In order to ensure that every new node represents an 
actual candidate of size n -I- 1, we need to check whether all the subsets of 
the itemset of size n are actually in the hash tree. Such an operation consists 
in a traversal of the tree from the enhanced node to the root node; for each 
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analyzed node, we simply check whether its subtree is also a subtree of its 
ancestor. Subtrees that do not satisfy such a requirement are cut (Fig. EF). 

— Finally, the itemset(T, S) predicate extracts the frequent itemset I (whose 
frequency is S) from T. Since each leaf node represents an itemset, generation 
of itemsets is quite simple. The tree is traversed and itemsets are built ac- 
cordingly. Notice that, differently from the previous predicates, where only 
one invocation was allowed, the itemset predicate allows multiple calls, pro- 
viding one answer for each itemset found. 

The above schema provides a declarative specification that is parametric to 
the intended meaning of the user-defined predicates adopted. The schema mini- 
malizes the “black-box” structure of the algorithm, needed to obtain fast count- 
ing of candidate itemsets and efficient pruning, and provides many opportunities 
of optimizing the execution of the algorithm both from a database optimization 
perspective and from a “constraints” embedding perspective m- 

4 Performance Analysis 

In this section we analyze the impact of the architecture we described in the 
previous sections to the process of extracting association rules from data. The 
performance analysis that we undertook compared the effect of mining associa- 
tion rules according to four different architectural choices: 

1. DB2 Batch, an Apriori implementation that retrieves data from a SQL 
DBMS, stores such data in an intermediate structure and then performs the 
basic steps of the algorithm using such structures. Such an implementation 
conforms to the Cache-Mine approach. The main motivation is to compare 
the effects of such an implementation with a similar one in the CVC~\-+ de- 
ductive database. Conceptually, such an implementation can be thought of 
as the architectural support for an SQL extension, like, e.g., the MINE RULE 
construct shown in HH. 

2. DB2 interactive, an Apriori implementation in which data is read tuple by 
tuple from the DBMS. This approach is very easy to implement and manage, 
but has the main disadvantage of the large cost of context switching between 
the DBMS and the mining process. Since user-defined predicates need also 
such a context switching, it is interesting to see how the approach behaves 
compared to the CVC++ approach. 

3. CDC + +, the implementation of the rules mining aggregate patterns, by 
means of the iterative aggregate specified in the previous section. 

4. a plain Apriori implementation {Apriori in the following), that reads data 
from a binary file. We used such an implementation to keep track of the 
actual computational effort of the algorithm on the given data size when no 
data retrieval and context switching overhead is present. 

We tested the effect of a very simple form of mining query -one that is 
expressible also in SQL- that retrieves data from a single table and applies 
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Fig. 3. Performance comparison and summary 



the mining algorithm. In CDC++ terms, the experiments were performed by 
querying cms(min_supp, R), where ans was defined as 

ans(S, patterns((S, ItemSet))) transaction(ID, ItemSet). 

and the trcinsaction(lD, ItemSet) relation is a materialized table. 

In order to populate the transaction predicate (and its relational counter- 
part), we used the synthetic data generation utility described in Sect. 2.4.3]. 
Data generation can be tuned according to the usual parameters: the number of 
transactions {\D\), the average size of the transactions (|T|), the average size of 
the maximal potentially frequent itemsets (j/j), the number of maximal poten- 
tially frequent itemsets (jlj), and the number of items {N). We fixed |/| to 2, 
and |T| to 10, since such parameters affect the size of the frequent itemsets. All 
the remaining parameters were adjusted according to increasing values of T>\ as 
soon as V increases, \X\ and N are increased as well. 

The following figures show how the performances of the various solutions 
change according to increasing values of T> and decreasing values of the support. 
Experiments were made on a Linux system with two 400Mhz Intel Pentium II 
processors, with 128Mb RAM. Alternatives 0 and were implemented using the 
IBM DB2 universal database v6.1. 

In Fig. 0it can be seen that, as expected, the DB2 (interactive) solution 
gives the worst results: since a cursor is maintained against the internal buffer of 
the database server, the main contribution to the cost is given by the frequent 
context switching between the application and the database server ■ Moreover, 
decreasing values of support strongly influence its performance: lower support 
values influence the size of the frequent patterns, and hence multiple scans over 
the data are required. 

Figure 13 shows that the CVC + + approach outperforms the DB2 (Batch) 
approach. However, as soon as the size of the dataset is increased, the difference 
between the two approaches tends to decrease: the graphs show that the CD C + + 
performance gradually worsens, and we can expect that, for larger datasets, 
DB2 (Batch) can outperform CD C + +. Such a behavior finds its explanation in 
the processing overhead of the deductive system with respect to the relational 
system, which can be quantified, as expected, by a constant factor. 
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Fig. 4. Gontext switching overhead 



The seconds graph in Fig. |3 summarizes the performance of the CDC++ sys- 
tem for different values of the data size. The performance graph has a smooth 
(almost linear) curve. The ratio between the data preprocessing of CVC++ and 
the application of the Apriori algorithm (i.e., the context switching overhead) is 
shown in Fig. 0 The ratio is 1 when the internal management phase is predomi- 
nant with respect to the application of the algorithm. As we can see, this ratio is 
particularly high with the last dataset, that does not contain frequent itemsets 
(except for very low support values), and hence the predominant computational 
cost is due to context switching. 



5 Conclusions and Future Work 

Iterative aggregates have the advantage of allowing the specification of data 
mining tasks at the desired abstraction level: from a conceptual point of view, 
they allow a direct use of background knowledge in the algorithm specification; 
from a physical point of view, they give the opportunity of directly integrating 
proper knowledge extraction optimizations. In this paper we have shown how 
the basic framework allows physical optimizations: an in-depth study of how to 
provide high-level optimization by means of direct exploitation of background 
knowledge has to be performed. 

The problem of tailoring optimization techniques to mining queries is a major 
research topic, in a database-oriented approach to data mining. It is not surpris- 
ing that such a topic is even more substantial in deductive-based approaches, 
like the one presented in this paper. In im we have shown some examples of how 
a logic based language can benefit of a thorough modification of the underlying 
abstract machine, and how other interesting ways of coping with efficiency can 
be investigated (for example, by extracting expressive subsets of teh language 
viable for efficient implementation). 

We are currently interested in formalizing such modifications in order to 
provide a mapping of deductive mining query specifications to query plan gen- 
erations and optimizations. 
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Abstract. In this paper we examine association rules and their interest- 
ingness. Usually these rules are discussed in the world of basket analysis. 
Instead of customer data we now study the situation with data records 
of a more general but hxed nature, incorporating quantitative (non- 
boolean) data. We propose a method for Ending interesting rules with 
the help of fuzzy techniques and taxonomies for the items/attributes. 
Experiments show that the use of the proposed interestingness measure 
substantially decreases the number of rules. 



1 Introduction 

In this paper we study association rules, i.e., rules such as “if a person buys 
products a and b, then he or she also buys product c” . Such a rule has a certain 
support (the number of records satisfying the rule, e.g., the number of people 
buying a, b and c) and confidence (the fraction of records containing the items 
from the “then part” out of those containing the items from the “if part”). In 
most practical situations an enormous number of these rules, usually consisting 
of two or three items, is present. One of the major problems is to decide which 
of these rules are interesting. 

Association rules are of particular interest in the case of basket analysis, but 
also when more general so-called quantitative or categorical data are considered, 
cf. |1 7j . Here one can think of augmented basket data, where information on 
the customer or time stamps are added, but also on more general fixed format 
databases. For example, one can examine a car database with information on 
price, maximum speed, horsepower, number of doors and so on. But also quite 
different databases can be used, for instance web-log files. So instead of products 
we shall rather speak of items or attributes, and buying product a should be 
rephrased as having property a. We get rules like “if a car has four doors and is 
made in Europe, then it is expensive” . 

If we only consider the support of a rule, there is no emphasis on either “if 
part” or “then part”, and in fact we rather examine the underlying itemset, in 
our first example (a, b, c}. A A:-itemset consists of k elements. Such a set is called 
frequent if its support is larger than some threshold, which is given in advance. 
In this paper we focus on the support rather than the confidence. 
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In the sequel we shall define a precise notion of interestingness, based on 
hierarchies with respect to the items. Using both simple real life data and more 
complicated real life data we illustrate the relevance of this notion. Our goal is to 
find a moderate number of association rules describing the system at hand, where 
uninteresting rules that can be derived from others are discarded. Interestingness 
of itemsets based on a hierarchy for the items is also discussed in where 
for a one taxonomy situation a different notion of lifting to parents is used. 
Several other measures of interestingness for the non-fuzzy case not involving 
taxonomies are mentioned in roil 01 151 and references in these papers; for a 
nice overview see jS]. 

We would like to thank Jan Niestadt, Daniel Palomo van Es and the referees 
for their helpful comments. 

2 Fuzzy Approach 

If one considers more general items/ attributes, one has to deal with non-boolean 
values. Several approaches have been examined, each having its own merits and 
peculiarities. Two obvious methods are the usual boolean discretization (see 
m; note that this method suffers from the sharp boundary problem) and the 
fuzzy method. In this paper we focus on the fuzzy approach: split a non-boolean 
attribute into a (small) number of possible ranges called fuzzified attributes, and 
provide appropriate membership values (see m^m)- 

Some attributes naturally split into discrete values, for instance number of 
doors, giving a small number of crisp values. One can choose to add as many 
new items/attributes as there are values. It is also possible, in particular for 
two-valued attributes, to keep the boolean 0/1 notation. One has to keep in 
mind however that this gives rise to an asymmetry in the following sense: since 
only non-zero values will contribute, the rules found do not deal with “negative” 
information. For instance, if an attribute Doors has two possible values, 2 and 4, 
one can either split it into two new attributes Doors2 and Doors4 (notice that 
always exactly one of these will be true, so there is a clear negative dependency), 
or to keep only one attribute having the value 1 in the case of four doors; in this 
case rules with “having two doors” cannot easily be found. 

An example for a more complex attribute is given in Fig. Q], where the at- 
tribute Horsepower is fuzzified into four regions. We can now say that a record 
has property a to a certain extent, e.g., a 68 Horsepower car has Hpl value 0.2 
and Hp2 value 0.8. In many situations the regions are chosen in such a way that 
for all values in the domain at most two membership values are non-zero. This 
approach is especially attractive, since it leads to hierarchies in a quite natural 
way: starting from basic ranges one can combine them into larger and larger 
ones in several ways, e.g., Hpl2 might be the union of Hpl and Hp2. Usually 
the membership values for the fuzzified attributes belonging to the same original 
attribute of a given record add to 1, as for the crisp case mentioned above. Note 
that the choice of the number of regions and the shape of the membership func- 
tions may be a difficult one. In this paper we use linear increase and decrease 
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Fig. 1. Membership values for attribute Horsepower, split into four regions 



functions for the boundaries of the regions. The fuzzifications are obtained man- 
ually; it is also possible to apply clustering algorithms to determine clusters, and 
then use these as a basis for the regions. 

In a fuzzy context the support of an itemset should be understood in the fol- 
lowing way: for every record in the database take the product of the membership 
values (that can be crisp) of the attributes under consideration, and sum these 
numbers. If we have n records, and if fr(i,j) denotes the membership value of 
the j-th (fuzzified) attribute of the z-th record, then the support of an itemset 
A = {ojj , , . . . , aj^, } is defined by 



n k 

support(A) = En Khje)- 
i=i e=i 

Notice that the usual 0/1 version is a special case. Here we mimic the well- 
known fuzzy And: And(o:, y) = x ■ y. Besides taking the product, there are also 
other possibilities for the fuzzy And, for instance the often used minimum. The 
product however has a beneficial property, which is easily demonstrated with an 
example. Suppose that a car has Hpl value 0.2 and Hp2 value 0.8 (the other Hp 
values being 0), and Pricel value 0.4 and Price2 value 0.6. Then it contributes 
to the combination {Hpl , Pricel} a value of 0.2 • 0.4 = 0.08, and similarly to 
the other three cross combinations values of 0.12, 0.32 and 0.48, respectively, 
the four of them adding to 1. The minimum would give 0.2, 0.2, 0.4 and 0.6, 
respectively, yielding a total contribution of 1.4 > 1. In similar crisp situations 
every record of the database has a contribution of 1, and therefore we prefer the 
product. 

Some simple example itemsets are {Milk, Bread}, {Milk, TimeEarly} and 
{Expensive, Europe, Door s4}- Notice that the first one refers to the number of 
people buying both milk and bread, the second one measures the number of 
people buying milk early in the day (where TimeEarly is a fuzzified attribute), 
and the third one deals with the occurrence of expensive European cars having 
four doors in the current database. 

In some situations it may occur that itemsets consisting of different “regions” 
of one and the same attribute have a somewhat high support, for instance the 
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itemset {Hpl , Hp2}. This phenomenon indicates that many records lie in the 
intersection of these regions, and that the attribute needs to be fuzzified in yet 
another way. 

3 Taxonomies 

Now we suppose that a user defined taxonomy for the items is given, i.e., a cat- 
egorization of the items/ attributes is available. In this setting association rules 
may involve categories of attributes; abstraction from brands gives generalized 
rules, that are often more informative, intuitive and flexible. As mentioned be- 
fore, also non-boolean attributes lead to natural hierarchies. Since the number 
of generated rules increases enormously, a notion of interestingness, cf. laitij . is 
necessary to describe them. It might for instance be informative to know that 
people often buy milk early in the day; on a more detailed level one might detect 
that people who buy low fat milk often do so between 11 and 12 o’clock. The 
more detailed rule is only of interest if it deviates substantially from what is 
expected from the more general one. It might also be possible to get more grip 
on the possible splittings of quantitative attributes, cf. 0 

A taxonomy is a hierarchy in the form of a tree, where the original items 
are the leaves, and the root is the “item” All] see |B| for the non-quantitative 
situation. The (internal) nodes of the taxonomies are sets of original items, these 
being singleton sets; every parent is the union of his or her children. In the case of 
fuzzy attributes, the fuzzy value of a parent is the sum of those from its children 
(assuming that this sum is at most 1), which corresponds to the fuzzy Or: 
OR{x,y) = min(l, a; -|- y). For example, the Hpl2 value for a 68 Horsepower car 
case is 0.2 -I- 0.8 = 1.0. One can also consider the case where several taxonomies 
are given. In this setting, an itemset is allowed to be any set of nodes from 
arbitrary levels from the taxonomies. Often we will restrict an itemset to belong 
to a single taxonomy. The root All is the set of all original items, and is the root 
of all taxonomies at hand. 

A simple example of a taxonomy for a car database, with attributes Pricel, 
Price2, Doors2, Doors4, Hpl, Hp2, Hp3 and Hp4, and aggregates Pricel 2, Hpl 2 
and Hp34, is presented in Fig. 0 

4 Interestingness 

An itemset (or rule) should be called interesting if it is in a way “special” with 
respect to what it is expected to be in the light of its parents. We first give 
some definitions concerning the connection between parent itemsets and their 
children. 

4.1 Definitions 

A first generation ancestor itemset of a given itemset is created by replacing one 
or more of its elements by their immediate parents in the taxonomy. For the 
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Fig. 2. Example - A simple taxonomy for a car database 



moment we choose to stay within one taxonomy, but it is also possible to use 
several taxonomies simultaneously. The only difference in that case is that ele- 
ments can have more than one parent. The support of an ancestor itemset gives 
rise to a prediction of the support of the fc-itemset X = {oi, 02, . . . , a^} itself: 
suppose that the nodes oi, 02, . . . , (1 < £ < fc) are replaced by {lifted to) their 

ancestors a), 02, ■ • ■ , Of (in general not necessarily their parents: an ancestor of 
a is a node on the path from the root All to a, somewhere higher in the taxon- 
omy) giving an itemset I. Then the support of X is estimated by the support of 
X times the confidence of the rule “oi, 02, ... ,ai implies Oi, 02, • . • , af’: 



EstimatedSupport^{{ai, 02, , ae, a^+i, . . . , aj.}) = 



Support{{oi,a2, . . . ,ae,ae+i ...,Ofc}) x 



Support{{ai,a2, . . . ,ae}) 
Support{{oi, 02, ..., at}) ' 



This estimate is based on the assumption that given the occurrence of the lifted 
items {oi, 02, . . . , di}, the occurrences of {oi, 02, ... , at} and {at+\, at+2, ■ ■ ■ , Ofe} 
are independent events, see | 5 |. In fact, this is a simple application of conditional 
probabilities: if 



P{X\ai,a2,. . . ,at) = 

P(ai,02, ... ,o^ I 01,02, ... , 0 ^) X P(o^+i, . . . ,Ofe I Oi,a'2, • ■ •,«£), 



then 



P{X) = P{ai, 02, ..., at) X P{X\oi, 02,..., at) 
= P{X) X P{ai,02, . . . ,at\ai,02, . . . ,at), 



where 



P(oi, 02, ...,at\ai,02,..., at) 



Support {{ai, 02, ■ . .,at}) 
Suppori{{ai, 02, . . .,ot}) ' 



Now an itemset is called interesting if and only if the predicted (fuzzy) sup- 
ports based on all (but one as we shall see soon (*)) of its first generation 
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ancestor itemsets deviate substantially from its real (fuzzy) support. If there is 
at least one parent that predicts the child suitably well, this itemset is not in- 
teresting enough. The word “substantially” means that the predicted supports 
are all larger than the real support, or are all smaller than the real support, 
by at least some fixed factor. This factor is called the interestingness threshold. 
If all items from an itemset are lifted, estimated support and real support are 
exactly the same, so it makes sense to omit this prediction (see (*)). Therefore 
1-itemsets are always interesting, in particular the itemset {All} (which does 
not have ancestors): there is no way to predict their support. In order to give a 
complete description of the “rule database” it is sufficient to describe the inter- 
esting rules: the behaviour of the others can then be derived - if one remembers 
which ancestor itemset provided the best prediction. 



4.2 More Details 

The reasons that only first generation ancestor itemsets are used instead of 
arbitrary ancestors as in m (where the number of items in the two itemsets 
should also be the same) are the following. First, it severely restricts the number 
of sets that need to be examined. (Note that in a single taxonomy a fc-itemset 
has 2^ — 2 first generation ancestor itemsets in principle, the number of arbitrary 
ancestors being much higher; k is small in practice.) And second, if a set cannot 
be understood through any of its parents, but some grandparent does predict its 
support, in our opinion it still deserves attention. 

Some problems arise during the lifting. In ^ the problem of several hierar- 
chies (one item may contribute to several lifted ones at the same time, e.g.. Milk 
is both Dairy and Fluid) is discussed. Another problem mentioned there is this: 
when lifting sibling attributes one gets itemsets of the form {Child, Parent}, 
e.g., {Milk, Dairy}. In ^ this was interpreted as the set {Milk}, since - logically 
speaking - buying milk is enough to satisfy both the milk and dairy require- 
ments: 



{Milk And Dairy) = {Milk And {Milk Or ...)) = Milk. 

In the fuzzy approach the lifting of siblings from one and the same original 
attribute (which only happens in rare situations) is treated in an analogous 
manner, using fuzzy And and Or. For example, suppose that a car has Hpl 
value 0.2 and Hp2 value 0.6 (this differs from the situation in Fig. O and in 
our experiments, where at most two membership values corresponding to one 
original attribute are non-zero), then its parent Hpl2 has value 0.2 -|- 0.6 = 0.8, 
and its contribution to the itemset {Hp2, Hpl2} equals 0.6 • (0.2 -|- 0.6) = 0.48. 
Note that this is analogous to the situation for crisp boolean attributes: for 
boolean x, y we have x A{y\/ x) = x, leading to the interpretation mentioned in 
the beginning of this paragraph. 

With respect to the partitioning of the attributes, either fuzzy or discrete, the 
notion of interestingness has yet another beneficial property. The support of an 
itemset may depend severely on the chosen partitioning. For instance, if a time 
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period is split into periods of different sizes, the smaller ones will naturally have 
lower support. In the definition of EstimatedSupport the support of an itemset 
is estimated by the support of its parent multiplied by a factor that accounts 
for the relative size. The chosen partitioning is therefore of less importance than 
one would think at first sight, if interestingness is considered. But still it is 
necessary to carefully make this choice, since the domains should be split into 
discriminating understandable parts. 

During experiments it sometimes occurred that itemsets containing some 
high supported item appeared to be interesting with respect to their first gen- 
eration ancestor itemsets. From another point of view however, they might be 
considered not that interesting. For example, if an itemset {a} has very high 
support, the itemsets {a,b,c} and {&, c} will probably have (nearly) the same 
support, and hence we feel that {a, 6, c} is not interesting. In general this phe- 
nomenon can be easily detected by checking whether the support of {a, &, c} can 
be predicted through that of {6, c}. This corresponds to the situation where in 
the formula for the estimated support one particular item is lifted to the artificial 
item All. Because this easily computed extra interestingness measure improves 
the quality of the rules found, we added it in our experiments. This measure 
for interestingness is analogous to that in |2|, where one or more items can be 
deleted from itemsets in order to check whether or not their support can be 
predicted from that of the smaller subsets. Finally, note that if in general one 
lifts to an attribute that is always 1, for instance the common ancestor of the 
different regions of fuzzified attributes, this corresponds to lifting to All. 

5 Algorithms 

The algorithms that find all interesting rules are straightforward. The well-known 
Apriori algorithm from P, or any of its refinements, provides a list of all as- 
sociation rules, or rather the underlying itemsets. This algorithm can be easily 
adapted to generate all rules including nodes from the taxonomy (for more de- 
tails, see uni), where special care has to be taken to avoid parent-child problems, 
and to the fuzzy situation (see ^21)- Note that Apriori works under the assump- 
tion that the support of a subset is always at least the support of any superset, 
which also holds in this generalized setting (all fuzzy membership values are at 
most 1 and we use multiplication as fuzzy And; by the way, the frequently used 
minimum can also be chosen). In fact, if one augments the list of original items 
with all non-leaves from the taxonomy, the computations are straightforward. 
Once the list of all rules and their supports is known, it is easy to generate the 
interesting ones by just comparing supports for the appropriate rules. The order 
in which the computations are performed, is of no importance. 

For every frequent itemset I all its first generation ancestor itemsets I are 
generated, and expected and real support are compared; we define the support 
deviation of I to be the smallest interestingness ratio 



Support{X) / EstimatedSupport^il) 
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that occurs. If this support deviation is higher than the interestingness thresh- 
old, the itemset is called interesting. The frequent itemsets can be ordered with 
respect to this support deviation: the higher this ratio, the more interconnec- 
tion occurs between the items involved. In fact, the assumption concerning the 
independence between lifted and non-lifted items clearly does not hold in that 
case, and an interesting connection is revealed. Of course it is also a possibility 
to look at overestimated supports - in many cases they are “complementary” 
to the underestimated ones. If necessary, the confidence can be used to turn the 
list of interesting itemsets into a list of interesting rules, further decreasing the 
number of interesting rules. Note that ancestors of frequent itemsets are auto- 
matically frequent, unless - as in |H| - different support thresholds are specified 
at different tree levels (if, e.g., {Milk, Bread} is frequent, {Dairy , Bread} should 
be frequent too in order to compute the support deviation). 

The run time of the algorithms may - as usual - be long when the number of 
records is large and the minimum support threshold is low. In order to also get 
information on the bottom level, and not only on aggregate levels, this minimum 
support should be small enough. A run time of several hours was quite normal, 
most of it devoted to the computation of the frequent itemsets using the Apriori 
algorithm. Once the rules/itemsets are computed, it is however easy to deal 
with different interestingness thresholds. This is an advantage over methods that 
detect interestingness during the computation of the frequent itemsets (cf. P|, 
where no taxonomies are used). 

6 Experiments 

In order to get a feeling for the ideas, we first present some details for a simple 
database consisting of descriptions of 205 cars, see 0. We have clearly dependent 
attributes like Price, MilesPerGallon, EngineSize and Horsepower (an integer 
between 48 and 288, the median being 96). This last attribute may be fuzzified 
as in Fig. 0 where it is split into four regions, denoted by Hpl, Hp2, Hp3 and 
Hp4- One might choose the regions in such a way that they all contain the 
same number of records - more or less (option 1 ). Another option is to split the 
region simply into four equally large intervals (option 2 ). We also examined a 
random fuzzification (option 3 ). Of course there is quite a lot of freedom here, 
but an advantage of the fuzzy method is that slight changes do not lead to major 
differences (see, e.g., H2| for a comparison between crisp case and fuzzy case) ; as 
mentioned above, the interestingness corrects for different splittings to a certain 
extent. At aggregate levels we defined Hpl2 and Hp34 as the “sum” of the first 
two regions, respectively the last two. In a similar way we also fuzzified the other 
attributes, all of them having four regions, region 1 corresponding to a low value, 
and so on. Furthermore we added the attributes Doors2, Doors4 and Turbo, the 
last one being a boolean attribute. 

Clear dependencies were easily detected in all cases, such as Price4 and 
Hp4 (more than expected), Price4 and MilesPerGallon4 (less than expected, 
but still enough to meet the support threshold), and Price4 and MilesPerGal- 
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lonl (more than expected). But also itemsets like {Hpl , Pricel , Doors2} were 
found to be interesting for option 1: all its eight parents (including those ob- 
tained by omitting an item) caused an interestingness ratio above 1.3. In Fig. 0 
some results with respect to the number of rules are presented. The itemset 
{Turbo , Hp34 T Price34} had support deviation 1.61, indicating that turbo en- 
gine cars occur quite often among cars with high values for Plorsepower and 
Price; but it also means that among expensive turbo engine cars those with a 
high value for Horsepower occur more than expected. The support threshold 
was chosen to be 10%. Here the notation 22 / 137 means that 22 out of 137 
itemsets are interesting. Note that option 2 leads to only 17 frequent 1-itemsets, 
due to the irregular distribution of the records over the equally sized intervals 
for the fuzzified attributes. We may conclude that a substantial reduction in the 
number of itemsets is obtained. 



option 

for 

fuzzification 


threshold 
for support 
deviation 


1-itemsets 


2-itemsets 


3-itemsets 


4-itemsets 


1 


1.3 


27 / 27 


34 / 137 


24 / 185 


14 / 103 




1.4 


27 / 27 


28 / 137 


14 / 185 


5 / 103 




1.5 


27 / 27 


22 / 137 


11 / 185 


4 / 103 




1.6 


27 / 27 


15 / 137 


11 / 185 


3 / 103 




1.7 


27 / 27 


9 / 137 


4 / 185 


1 / 103 




1.8 


27 / 27 


4 / 137 


1 / 185 


0 / 103 




1.9 


27 / 27 


4 / 137 


0 / 185 


0 / 103 




2.0 


27 / 27 


4 / 137 


0 / 185 


0 / 103 


2 


1.3 


17 / 17 


15 / 87 


7 / 173 


2 / 148 




1.4 


17 / 17 


13 / 87 


5 / 173 


1 / 148 




1.5 


17 / 17 


10 / 87 


4 / 173 


1 / 148 




1.6 


17 / 17 


9/87 


4 / 173 


1 / 148 




1.7 


17 / 17 


8/87 


4 / 173 


1 / 148 




1.8 


17 / 17 


7/87 


2 / 173 


1 / 148 




1.9 


17 / 17 


5/87 


2 / 173 


1 / 148 




2.0 


17 / 17 


4/87 


2 / 173 


1 / 148 


3 


1.3 


24 / 24 


20 / 109 


12 / 162 


4 / 108 




1.4 


24 / 24 


14 / 109 


5 / 162 


1 / 108 




1.5 


24 / 24 


13 / 109 


5 / 162 


1 / 108 




1.6 


24 / 24 


12 / 109 


5 / 162 


1 / 108 




1.7 


24 / 24 


10 / 109 


5 / 162 


1 / 108 




1.8 


24 / 24 


8 / 109 


5 / 162 


1 / 108 




1.9 


24 / 24 


7 / 109 


4 / 162 


1 / 108 




2.0 


24 / 24 


4 / 109 


1 / 162 


0 / 108 



Fig. 3. Car database: number of interesting itemsets out of all frequent itemsets, for 
different fuzzifications and thresholds for the support deviation 
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Next we considered a much larger database, obtained from product and sales 
information from supermarket chains. For every product, and every time period, 
and for every chain, the number of sales is given - among other things. We 
restricted ourselves to one chain. The database consisted of 158,301 records, 
giving sales for 4,059 products over a period of three years (split into 39 periods). 

We took minimum support 1%, leading to 163 frequent 1-itemsets, 378 fre- 
quent 2-itemsets and 102 frequent 3-itemsets. With a small interestingness thresh- 
old of 1.01, 162 2-itemsets were found to be interesting, and 46 3-itemsets. As in 
the previous example, some obvious itemsets were found quite easily. For exam- 
ple, {BrandX , SmallBag} and {Mayonnaise, Jar} were above expectation, with 
support deviations 3.50 and 2.85, respectively. Here Mayonnaise denotes a group 
of mayonnaise-like products, and BrandX consists of instant soups and sauces 
of a certain brand. The package clearly depends on the contents. Some interest- 
ing 3-itemsets were also discovered, for example {BBQSauce, Bottle, Chili} with 
support deviation 5.89. Apparently Chili taste in a bottle is even more frequent 
among other BBQ sauces. 

It was much harder to find interesting itemsets containing time informa- 
tion, because the support of itemsets containing for example Months were much 
smaller by nature than the ones mentioned in the previous paragraph. If one only 
examines the records corresponding to one category of products, for instance the 
BBQ sauces (for our database 21,450 records), it is possible to detect small dif- 
ferences in sales throughout the year. It appeared that in the third quarter of 
the year high sales were more frequent than expected, whereas low sales were 
more frequent than expected in the first quarter of the year. 

Two important problems that arose are the following. Due to the fact that 
there were very many missing values in the database at hand, for some attributes 
it was hard to find a proper interpretation for the results. For the moment 
we chose not to skip the complete record; we ignored the missing fields when 
generating the frequent itemsets containing these fields. The second problem has 
to do with the fuzzifying process. In the database the number of products sold 
during some period in some supermarket chain is given. If one wants to fuzzify 
this number, one clearly has to take into account that notions like “many” or 
“few” severely depend on the product and the shop at hand. The usual data 
mining step that cleans the data has to be augmented with a process that handles 
this problem. For the current experiment we simply took global values, which 
seems justified because we deal with only one supermarket chain or even one 
category. 

7 Conclusions and Further Research 

We have presented a notion of interestingness for frequent itemsets in gen- 
eral fixed format databases with both quantitative, categorical and boolean 
attributes, using fuzzy techniques. Examples show that the number of item- 
sets found decreases substantially when restricted to interesting ones. It is in 
principle also possible to handle important attributes like time. 
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We would like to study this time dependency of itemsets further, for instance 
using time windows. It should also be possible to use the notion of interesting- 
ness for clustering techniques, cf. m Other research issues are the handling of 
missing values and different fuzzifications. If for instance both price and sales 
attributes in a customer database are missing quite often, this might lead to an 
underestimated value for the support of itemsets containing both price and sales 
attributes. We would like to handle these kinds of problems, both in theoretical 
and practical respect. 

Another problem is the following: it is sometimes necessary to split one and 
the same fuzzy attribute (like the number of sales in the second experiment) 
differently, depending on the record or the group of records. For example a 
sales of 1,000 may be “many” for BrandX but “few” for BrandY. It would be 
interesting to study different possibilities here, especially for practical situations. 
Finally we would like to get a better understanding of the missing value problem. 
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Abstract. Data mining tries to discover interesting and surprising pat- 
terns among a given data set. An important task is to develop effective 
measures of interestingness for evaluating and ranking the discovered 
patterns. A good measure should give a high rank to patterns, which have 
strong evidence among data, but which yet are not too obvious. Thereby 
the initial set of patterns can be pruned before human inspection. In 
this paper we study interestingness measures for generalized quantita- 
tive association rules, where the attribute domains can be fuzzy. Several 
interestingness measures have been developed for the discrete case, and 
it turns out that many of them can be generalized to fuzzy association 
rules, as well. More precisely, our goal is to compare the fuzzy version 
of confidence to some other measures, which are based on statistics and 
information theory. Our experiments show that although the rankings of 
rules are relatively similar for most of the methods, also some anomalies 
occur. Our suggestion is that the information-theoretic measures are a 
good choice when estimating the interestingness of rules, both for fuzzy 
and non-fuzzy domains. 



1 Introduction 

Data mining, also referred to as knowledge discovery in databases, is concerned 
with the nontrivial extraction of implicit, previously unknown, and potentially 
useful information from data m One major application domain of data mining 
is the analysis of transactional data. The problem of mining boolean association 
rules over basket data was first introduced in |[Q, and later broadened in |2|, for 
the case of databases consisting of categorical attributes alone. 

For example, in a database maintained by a supermarket, an association rule 
might be of the form: 

“beer and potato chips — >■ diapers (support: 2%, confidence: 73%)”, 

which means that 2% of all database transactions contain the data items beer, 
potato chips and diapers, and 73% of the transactions that have the items “beer” 
and “potato chips” also have the item “diapers” in them. The two percentage 
values are referred to as support and confidence, respectively. 
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In practice the information in many, if not most, databases is not limited to 
categorical attributes (e.g. zip code, make of car), but also contains much quan- 
titative data (e.g. age, income). The problem of mining quantitative association 
rules was introduced and an algorithm proposed in ca The algorithm involves 
discretizing the domains of quantitative attributes into intervals in order to re- 
duce the domain into a categorical one. An example of such an association might 
be “10% of married people between 50 and 70 have at least 2 cars”. 

Without a priori knowledge, however, determining the right intervals can 
be a tricky and difficult task due to the “catch-22” situation, as called in ini, 
because of the effects of small support and small confidence. Moreover, these 
intervals may not be concise and meaningful enough for human experts to easily 
obtain nontrivial knowledge from those rules discovered. 

Instead of using sharp intervals, fuzzy sets were suggested in H21 to represent 
intervals with non-sharp boundaries. The obtained rules are called fuzzy associ- 
ation rules. If meaningful linguistic terms are assigned to fuzzy sets, the fuzzy 
association rule is more understandable. The above example could be rephrased 
e.g. “10% of married old people have several cars” . Algorithms for mining fuzzy 
association rules were proposed in ( 0 . 0 ), but the problem is that an expert 
must provide the required fuzzy sets of the quantitative attributes and their cor- 
responding membership functions. It is unrealistic to assume that experts can 
always provide the best fuzzy sets for fuzzy association rule mining. In |H], we 
tackled this problem and proposed a method to find the fuzzy sets for quantita- 
tive attributes by using clustering techniques. 

It has been recognized that a discovery system can generate a large number 
of patterns, most of which are of no interest to the user. To be able to prune 
them, researchers have defined various measures of interestingness for patterns. 
The most popular are confidence and support Q, others include e.g. variance 
and chi-square (correlation) entropy gain laplace |5], and intensity of 
implication ^ . Properties of various measures were analyzed in . An extensive 
survey of recently proposed interestingness measures is given in mg. 

The term ‘interestingness’ is often used in a subjective sense, meaning the 
same as ‘surprisingness’. Here we take the view that it should be also measurable 
in more precise terms. Although many interestingness measures have been devel- 
oped for “discrete” domains, they are not directly applicable to other problem 
domains, such as fuzzy association rules. In this paper, we introduce generaliza- 
tions of interestingness measures for fuzzy association rules, based on statistics 
and information theory. Especially, we present two new measures using the en- 
tropy concept. Our suggestion is that these information-theoretic measures are 
a good choice when estimating the interestingness of rules, both for fuzzy and 
non- fuzzy domains. 

The rest of this paper is organized as follows. In the next section, we give 
a short summary of fuzzy association rules. Then we propose six measures for 
this fuzzy approach in Sect. 3. In Sect. 4 the experimental results are reported, 
comparing the proposed fuzzy interestingness measures. The paper ends with a 
brief conclusion in Sect. 5. 
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2 Fuzzy Association Rules 

Let I = {ii, i 2 , • ■ • , *m} be the complete set of attributes where each ij 
(1 < j < to) denotes a categorical or quantitative attribute. Note that categories 
are a special case of quantitative attributes, and can be handled similarly. In |^, 
we proposed a method to find the fuzzy sets for each quantitative attribute 
by using clustering techniques. We defined the goodness index G for clustering 
scheme evaluation, based on two criteria: compactness and separation. The clus- 
tering process determines both the number (c) and centers (rj,i = 1, . . . ,c) of 
clusters. We divide the attribute interval into c sub-intervals around the clus- 
ter centers, with a coverage of p percent between two adjacent ones, and give 
each subinterval a symbolic name related to its position (Fig. Q. The non-fuzzy 
partitioning is obtained as a special case by setting p to zero. 




MinValue r, d, da ra da da ra MaxValue 

(low) (middle) (high) 

Fig. 1. Example of the proposed fuzzy partitions 



For fuzzy set i, means the effective upper bound, and is given by: 






1 /100-p 



(ri+i - n ) , 



where p is the overlap parameter in %, and is the center of cluster i, i = 
l,2,...,c-l. 

Similarly, for fuzzy set j, dj means the effective lower bound, given by: 



1 /100-p 

F = --i - loo ■ 






where j = 2, 3, . . . , c. 

Then, we generate the corresponding membership function for each fuzzy 
set of a quantitative attribute; for formulas, see |H|. Finally, a new transformed 
(fuzzy) database is generated from the original database by applying the 
discovered fuzzy sets and the membership values. 

Given a database with attributes / and the fuzzy sets 

F(ij) associated with attributes ij in I, we use the following form for a fuzzy 
association rule m 
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If X = {xi,X2, ■ ■ - ^Xp} is A = {oi, 02, . . . , Op} 
then Y = {yi, 2/2, ■ ■ ■ , 2/9} is B = {61, 62, . . . , fe,}, 

where at G F{xi), i = 1, . . . ,p, bj G F{yj), j = 1,. . . ,q. X and Y are ordered 
subsets of / and they are disjoint i.e. they share no common attributes. A and 
B contain the fuzzy sets associated with the corresponding attributes in X and 
Y . As in the binary association rule, “X is A” is called the antecedent of the 
rule while “X is is called the consequent of the rule. We also denote Z = 
X U X — {zi, . . . , Zp-\-q^ and C — AUf? — {ci,..., Cp-|_g}. 

3 Interestingness Measures for Fuzzy Association Rules 

One problem area in knowledge discovery is the development of interestingness 
measures for ranking the usefulness and utility of discovered patterns and rules. 
In this section, we first describe the fuzzy itemset measures, then we propose 
six other candidate measures for fuzzy association rules. Three of these six are 
based on traditional statistics, and their non-fuzzy counterparts have occured 
many times in the literature. The three other measures, on the other hand, 
have their basis in the information theory, originally developed by Shannon, 
see e.g. US). Although our experiments in Sect. 4 do not show a big difference 
between the two categories of methods, we conjecture that the information- 
theoretic measures are in some cases better in capturing dependences among 
data. Another classification of measures would be on the basis of linear/nominal 
scale, but our formulations are such that this separation need not be explicit. 

3.1 Fuzzy Itemset Measures 

Let be a database, where n denotes the total number of 

records (‘transactions’). Let {Z,C) be an attribute-fuzzy set pair, where Z is an 
ordered set of attributes Zj and C is a corresponding set of fuzzy sets Cj . (From 
now on, we prefer to use the word “itemset” instead of “attribute-fuzzy set pair” 
for (Z,C) elements). If a fuzzy association rule (X, A) — )> (Y,B) is interesting, 
it should have enough fuzzy support FS(^z,c) and a high fuzzy confidence value 
FC(^(^x,a),(y,b))j where Z = X L) X, C = A U B. 

The fuzzy support value is calculated by multiplying the membership grade 
of each (zj, Cj), summing them, then dividing the sum by the number of records 
We prefer the product operator as the fuzzy AND, instead of the normal 
minimum, because it better distinguishes high- and low-support transactions. 

PSiz,c) = , 

where m is the number of items in itemset (X, C). 

The fuzzy confidence value is calculated as follows: 
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Table 1. Part of database containing fnzzy membership 



{Balance, low) 




{Income, low) 




{Credit, low) 


{Credit, high) 


0.2 




0.2 




0.1 


0.9 


0.4 




0.2 




0.1 


0.9 


0.9 




0.8 




0.7 


0.3 


0.9 




0.8 




0.9 


0.1 


0.6 




0.4 




0.9 


0.1 



Both of the above formulas are direct generalizations of the corresponding 
formulas for the non-fuzzy case 

We shall use the following two rules to demonstrate the calculation of inter- 
estingness measures. The data behind the rules are presented in Table 1. 

Let X = {Balance, Income}, A = {low, low}, Y = {Credit}, B = {low}, 
and B = {high}. Rulei {X, A) — >• {Y, B) is given by: 

“If Balance is low and Income is low then Credit is low'" , 
and Rule 2 {X,A) — >• (Y,B) is phrased as: 

“If Balance is low and Income is low then Credit is high" . 

The consequents are thus complements of each other. From the table data, we 
can easily see that Rulei should be classified as ‘valid’, whereas Rule 2 should 
not. When introducing the different measures of interestingness, we will check, 
how they are able to confirm this observation. 

Example 1. Fuzzy confidence does a good job in assessing the rules: 

0.004 -b 0.008 -b 0.504 -b 0.648 -b 0.216 
{(x,A).{Y.B)) - 0.04 -b 0.08 -b 0.72 -b 0.72 -b 0.24 “ ' 

0.036 -b 0.072 -b 0.216 -b 0.072 -b 0.024 ^ 

^((x,a).(y;b)) - 0.04 -b 0.08 -b^ 0.72 + 0.72 -b' 0.24 

For Rulei, FS(yb) = 0-54 and FS(^z,c) = 0.276. Similarly, FS^ys) = 0.46, and 
PS(^z,c) = 0.084 for Rule 2 . In both cases, FS(^x,a) = 0-36. 



3.2 Ftizzy Covariance Measure 

Covariance is one of the simplest measures of dependence, based on the co- 
occurrence of the antecedent {X,A) and consequent (Y,B). If they co-occur 
clearly more often than what can be expected in an independent case, then 
the rule {X, A) -A (Y, B) is potentially interesting. Piatetsky-Shapiro called this 
measure a rule-interest function M- We extend it to the fuzzy case, and define 
the covariance measure as: 

CoV((^X.A),(Y,B)) = FS(^z,C) — FS(x,A) ■ FS(^y,B)- 

Example 2. The covariance measures for our sample rules are: 

Cov((^x,A)XY,B)) = 0.0816, and = -0.0816. 
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3.3 Fuzzy Correlation Measure 

Covariance has generally the drawback that it does not take distributions into 
consideration. Therefore, in statistics, it is more common to use so called cor- 
relation measure, where this drawback has been eliminated. Again, we have to 
generalize the non-fuzzy formula to the fuzzy case, and obtain: 



C'orr((x,A),(F,B)) 

where 

Var(x,A) 

FS(^x,ay 

similarly for (Y,B). 



_ C'ot’((x,A),(y,5)) 

y/Var(^x,A) ■ Var(Y,B{ 

= FS(^x,AY ~ {^^{X,A)) > 
n 



These definitions are extensions of the basic formulas of variance and covari- 
ance. The value of the fuzzy correlation ranges from -1 to 1. Only a positive 
value tells that the antecedent and consequent are related. The higher the value 
is, the more related they are. 

Example 3. Again, applying the formula to our two sample rules, we obtain: 
Corr((x,A),(ys)) = 0.738, and Corrf^f^^ x),(Y:B)) = -0-738. 



3.4 Fuzzy I-Measure 



As an example of a more ‘exotic’ probability-based interestingness measure, we 
give the fuzzy version of a so-called I -measure suggested by Gray and Orlowska 
|E]. Though it has some structural similarity with correlation, we regard it rather 
as a heuristic measure. The fuzzy I -measure is defined as: 



L((X,A),{Y,B)) 



FS, 



(Z,C) 



FS(^x,a) ■ FS^y,b) 



- 1 



{FS(^x,a) ■ FSi^y,b)) 



where k and m are weight parameters of the two terms. A practical problem in 
applying this measure is the selection of these parameters. 



Example 4- The I -measure values for our two sample rules (for k = m = 2) are: 



I({x,a),(y,b)) = 0.038, and I(^(^x,a),{y,b)) = -0-02. 



In data mining, an association rule X ^ Y usually means that X implies Y 
and we cannot assume Y also implies X. Covariance, Correlation and I -measure 
are symmetric with respect to (X,A) and (Y,B). Thus, we can use them only 
as non-directed measures of dependence. 
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3.5 Fuzzy Unconditional Entropy (UE) Measure 

Assume that we want to evaluate rule {X, A) — >■ (Y, B), and denote Z = X VJY 
and C = A\JB. For (X, A), {Y, B), and {Z, C) we can calculate the probability of 
occurrence, based on the transactions. These are in fact the same as the (fuzzy) 
support FS(^x,a)i FS(^y,b): and FS(^z,c)- If (^)^) and (Y, B) are independent, 
then it holds that 

FS(^z,c) = FS(^x,A) ■ FS(y,b)- 

However, if there is a dependence, then the equality does not hold. The degree of 
correlation can be measured as follows. We determine the information amount 
needed by assuming independence; we call this independence entropy, denoted 
H{{x,a)-,{y,b)) and computed as follows: 

H((X,A);(Y,B)) = — FS(^z,C) ■ log 2 {FS(^x,A) ■ FS(^y,B)) — 

— (l — FS^z,o) ■ ^ 092(1 — FS^x,a) ■ FS(^y,b))- 



This represents the amount of information needed per transaction, when 
using a (false) assumption of independence, applied when true probability is 
FS(z,c)- The true entropy of {Z,C) is computed as follows: 

H(z,c) = —FS(^z,c) ■ log2 {FS(z,c)) ~ (l ~ FS(z,c)) ■ ^032(1 — FS(^z,c))- 

Since this formula uses precise probabilities, its value is always smaller than 
or equal to the independence entropy. Moreover, their difference is larger when 
the dependence is higher. Therefore, we get a good measure of correlation as the 
difference, which we call unconditional entropy (UE): 

UE(^(^x,a),(y,b)) = H{(x,a)-,{y,b)) — H{z,c)- 

Notice that although the measure is always non-negative, the related correla- 
tion can be either positive or negative, so that {X, A) and {Y, B) occur together 
either more or less frequently than in an independent case. Therefore, the true 
consequent of the (interesting) rule can be either (Y, B) or its complement. The 
latter holds if FS(^z,c) < FS(^x,a) ■ FS(^y,b), he. covariance is < 0 . 

Example 5. Let us compute the U E-vahies for our two sample rules: 

U E^^x.a),{y,b)) ~ 0.878 — 0.850 — 0.028, and UE^^^ [y~b)) ~ 0-041. 

Although the latter value is higher than the former, the condition FS(^z,c) = 
0.084 < FS(x.a) ■ FS^yb) = 0-36 ■ 0.46 = 0.1656 holds, and we conclude that 
Rule 2 is not valid. Instead, for Rulei FS(^z,c) = 0.276 > FS(x,a) ■ FS(^y,b) = 
0.1944, so Rulei is a ‘good’ one. 
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3.6 Fuzzy Conditional Entropy (CE) Measure 

UE-measme is analogous to the correlation of a rule in the sense that both for- 
mulas are symmetric with respect to {X, A) and (Y, B). We now develop another 
measure, which makes a distinction between the antecendent and consequent. 
Hereby we obtain an information-theoretic counterpart of confidence. The rea- 
soning resembles the derivation of UE, but from the consequent’s point of view. 
The unconditional entropy of (Y,B) is computed as follows: 



H(y,b) = —FS(y,b) ■ log2 {FS{y,b)) ~ (l ~ FS(y,b)) ■ ^ 052(1 — FS(y^b))- 

If (AT, A) affects in some way on (Y,B), then the conditional probability 
Pi(Y,B)\(x,A)) is different from P{y,b)- Notice that the conditional probability is 
the same as (fuzzy) confidence FC(^(^x,a),{y,b)), defined earlier. The conditional 
entropy is computed as 

H((y,B)\{X,A)) = — FCs^(x,A),{Y,B)) ■ log 2 {FC(^(^x,A),(Y,B))) — 

— (1 — FC(^(^x,a),{y,b))) ■ ^032(1 — FC((^x,a),{y,b)))- 

Since the conditional entropy uses a more precise value for the probability 
of (Y,B) among the subset studied (transactions satisfying (X, A)), for ‘true’ 
rules, FI{y,b) should be larger than i?((y,B)|(x,A))- Their difference represents the 
deviation from the independent case, and measures the dependence of (Y, B) on 
(X, A). The interestingness measure is thus defined as 

CF((^x,a),{y,b)) = H(y,b) — H((y,b)\{x,a))- 

The larger the value, the higher the dependence. As for UE, also here the 
actual consequent of a rule classified as interesting can be either (Y, B) or its 
complement. The latter holds if FC(^(^x,a),(y,b)) < FS(^yb)- 

It should be noted that CE is similar to the Theil index HH], in the sense 
that both measure deviation from the expected entropy. 

Example 6. The CE-measure gives the same value for both of our sample rules, 
because the consequents are complements to each other: 

CF(^{^x,a),(y,b)) = = 0-995 — 0.784 = 0 . 211 . 

This is just what should happen in information-theoretic sense. Our addi- 
tional condition determines that Rulei is positive and Rule2 is negative. 



3.7 Fuzzy J-measure 



Information theory has naturally been applied to measuring interestingness of 
rules before. One such measure, so called J-measure, was suggested by Smyth 
and Goodman PI- It can be generalized to the fuzzy rules as follows: 



J({X,A),{Y,B)) — FS(jy^b) 



FS(z,c) 

FS(^y,b) 



■ log2 



FS(z,c) \ 
FS(^y,b) ■ FS(x,a) ) 
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+ 



(l _ ^ 

V FS(y,b)J 



1 - 



FS, 



■ log 2 



FS, 



jz,c) 



(Y,B) 



1-FS, 



{X,A) 



The first term FS(^y.b) measures the generality of the rule. The term in- 
side the square brackets measures the relative entropy of {Y,B) on {X,A), the 
similarity of two probability (support) distributions. Though the /-measure is 
different from our C/i?-measure, in experiments it gave rather similar rankings 
to rules (see next section). 



Example 7. The /-measures for our two test rules are: 



J{{x,A),(Y,B)) — 0.037, and '^qx,A),(v,s)) — 0.050. 



4 Experimental Results 

We assessed the effectiveness of our interestingness measures by experimenting 
with a real-life dataset, which comes from a research by the U.S. Census Bureau. 
The data had 6 quantitative attributes. This database has been used in previous 
data mining research ( 0 > 0 ) and will not be described again here. 

Using support threshold = 20% and confidence threshold = 50%, we get 
exactly 20 rules, which are evaluated in the tests. Table 2 and Table 3 describe 
the calculated interestingness and the assigned ranks, respectively, as determined 
by the corresponding interestingness measure. 

To quantify the extent of the ranking similarities between the seven measures, 
we computed the correlation coefficient for each pair of interestingness measures, 
see Table 4. The coefficients vary from a low 0.243 for the pair Conf and /- 
measure, to a high of 0.988 for the pair UE- and J -measure. 

We found two distinct groups of measures, which are ranked similarly. One 
group consists of the non-directed measures Gov, Gorr, I -measure, UE-measure, 
and J -measure. The other group consists of the directed measures GE, and Gonf. 
However, there are no negative correlations between the two groups. Two repre- 
sentatives from both groups are shown in Fig. and Fig. 05. 

At this point the reader may wonder, what advantage do information-theore- 
tic measures give over statistical ones, if any. The difference comes e.g. in cases 
where support values are rather high. High support implies also a rather high 
confidence, even in a case where the antecedent and consequent are independent. 

However, GE gives a value rs 0 in this case, pruning the rule. That will happen 
also with the ‘symmetric’ measures Gov, Gorr, Fmeasure, and UE-measure, but 
their drawback is lack of direction. Thus, our conjecture is that GE is a good 
means of measuring the interestingness of rules. 

5 Conclusion and Future Work 

In this paper we have studied interestingness measures for generalized quantita- 
tive association rules, where the attribute domains can be fuzzy. We compared 
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Fig. 2. Rankings of interestingness measures - (a) groupi, (b) group 2 



the fuzzy version of confidence to six other measures, three of which were statisti- 
cal, and the rest three were information-theoretic, based on the entropy concept. 

The experiments show that rankings of rules are relatively similar for most 
of the methods, but also several inversions appeared in the ranking order. Scores 
of interestingness measures were used to compute the correlation coefficients, 
revealing two categories of measures, the directed and non-directed ones. We 
suggest that the information-theoretic measures are a good choice when esti- 
mating the interestingness of rules, both for fuzzy and non-fuzzy domains. 

Here we compared the measures of interestingness only by means of numerical 
rankings, obtained from experiments. In the future, we plan to compare them 
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Table 2. Interestingness measnres - scores 



Rule 


Conf 


Cov 


Corr 


I-measure 


UE-measure 


CE-measure 


J-measure 


1 


0.8384 


0.0874 


0.4522 


0.0297 


0.0427 


0.3619 


0.0524 


2 


0.7841 


0.0812 


0.4018 


0.0292 


0.0347 


0.2473 


0.0435 


3 


0.8418 


0.0933 


0.4728 


0.0335 


0.0464 


0.3700 


0.0580 


4 


0.8504 


0.1601 


0.7232 


0.0970 


0.0938 


0.3911 


0.1558 


5 


0.7067 


0.0442 


0.2102 


0.0205 


0.0081 


0.1066 


0.0102 


6 


0.6416 


0.0528 


0.2450 


0.0216 


0.0128 


0.0586 


0.0171 


7 


0.7655 


0.0921 


0.4273 


0.0395 


0.0387 


0.2141 


0.0519 


8 


0.8022 


0.1514 


0.6704 


0.0967 


0.0805 


0.2824 


0.1425 


9 


0.6804 


0.0398 


0.1809 


0.0208 


0.0060 


0.0756 


0.0080 


10 


0.5768 


0.0338 


0.1498 


0.0150 


0.0049 


0.0170 


0.0068 


11 


0.8521 


0.1680 


0.7560 


0.1064 


0.1005 


0.3954 


0.1751 


12 


0.6468 


0.0576 


0.2635 


0.0249 


0.0146 


0.0630 


0.0199 


13 


0.7093 


0.0475 


0.2231 


0.0233 


0.0090 


0.1100 


0.0116 


14 


0.5918 


0.0682 


0.3433 


0.0522 


0.0144 


0.0244 


0.0367 


15 


0.5117 


-0.0064 


-0.0316 


-0.0046 


0.0001 


-0.0008 


0.0003 


16 


0.6399 


0.0394 


0.2035 


0.0340 


0.0046 


0.0369 


0.0097 


17 


0.8148 


0.0569 


0.2923 


0.0196 


0.0172 


0.2884 


0.0200 


18 


0.9999 


0.0474 


0.3105 


0.0198 


0.0101 


0.7274 


0.0108 


19 


0.9679 


0.0372 


0.2675 


0.0142 


0.0066 


0.5240 


0.0070 


20 


0.9157 


0.0340 


0.2132 


0.0166 


0.0046 


0.3118 


0.0050 



also by qualitative means. We also intend to use more diverse and extensive test 
data to confirm the claims made in this paper. 
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Table 3. Interestingness measures - ranks 



Rule 


Conf 


Cov 


Corr 


I-measure 


UE-measure 


CE-measure 


J-measure 


1 


7 


6 


5 


8 


5 


6 


5 


2 


10 


7 


7 


9 


7 


10 


7 


3 


6 


4 


4 


7 


4 


5 


4 


4 


5 


2 


2 


2 


2 


4 


2 
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13 


14 


16 


14 


14 


13 


14 
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11 
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11 
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11 
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11 
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13 
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11 


19 


15 
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17 


20 


3 


18 


15 


17 


19 


7 
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Table 4. Correlation coefficients for interestingness measures 





Conf 


Cov 


Corr 


I-measure 


UE-measure 


CE-measure 


J-measure 


Conf 


- 


0.378 


0.500 


0.243 


0.359 


0.949 


0.300 


Cov 


0.378 


- 


0.987 


0.945 


0.975 


0.372 


0.967 


Corr 


0.500 


0.987 


- 


0.915 


0.957 


0.495 


0.940 


I-measure 


0.243 


0.945 


0.915 


- 


0.919 


0.245 


0.960 


UE-measure 


0.359 


0.975 


0.957 


0.919 


- 


0.380 


0.988 


CE-measure 


0.949 


0.372 


0.495 


0.245 


0.380 


- 


0.326 


J-measure 


0.300 


0.967 


0.940 


0.960 


0.988 


0.326 
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Abstract. In the last years the availability of huge transactional and experimen- 
tal data sets and the arising requirements for data mining created needs for clus- 
tering algorithms that scale and can be applied in diverse domains. Thus, a vari- 
ety of algorithms have been proposed which have application in different fields 
and may result in different partitioning of a data set, depending on the specific 
clustering criterion used. Moreover, since clustering is an unsupervised process, 
most of the algorithms are based on assumptions in order to define a partition- 
ing of a data set. It is then obvious that in most applications the final clustering 
scheme requires some sort of evaluation. 

In this paper we present a clustering validity procedure, which taking in account 
the inherent features of a data set evaluates the results of different clustering al- 
gorithms applied to it. A validity index, S_Dbw, is defined according to well- 
known clustering criteria so as to enable the selection of the algorithm provid- 
ing the best partitioning of a data set. We evaluate the reliability of our ap- 
proach both theoretically and experimentally, considering three representative 
clustering algorithms ran on synthetic and real data sets. It performed favorably 
in all studies, giving an indication of the algorithm that is suitable for the con- 
sidered application. 



1 Introduction & Motivation 

Clustering is one of the most useful tasks in data mining process for discovering 
groups and identifying interesting distributions and patterns in the underlying data. 
Thus, the main concern in the clustering process is to reveal the organization of pat- 
terns into “sensible” groups, which allow us to discover similarities and differences, 
as well as to derive useful inferences about them [7] . 

In the literature a wide variety of algorithms have been proposed for different appli- 
cations and sizes of data sets [14]. The application of an algorithm to a data set aims 
at, assuming that the data set offers such a clustering tendency, discovering its inher- 
ent partitions. However, the clustering process is perceived as an unsupervised proc- 
ess, since there are no predefined classes and no examples that would show what kind 
of desirable relations should be valid among the data [2]. Then, the various clustering 
algorithms are based on some assumptions in order to define a partitioning of a data 
set. As a consequence, they may behave in a different way depending on; i) the fea- 
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Fig. 1. A two dimensional data set partitioned into (a) three and (b) four clusters using K- 
Means (c) three clusters using DBSCAN 




Eps=2, Nps=4 



Eps=6, Nps=4 



Fig. 2. The different partitions resulting from running DBSCAN with different input parameter 
values 



tures of the data set (geometry and density distribution of clusters) and ii) the input 
parameters values. 

Partitional algorithms such as K-means [2] are unable to handle noise and outliers 
and they are not suitable to discover clusters with non-convex shapes. Moreover, they 
are based on certain assumptions to partition a data set. They need to specify the 
number of clusters in advance except for CLARANS [11], which needs as input the 
maximum number of neighbors of a node as well as the number of local minima that 
will be found in order to define a partitioning of a data set. Also, hierarchical algo- 
rithms [9] proceed successively by either merging smaller clusters into larger ones or 
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by splitting larger clusters. The result of these algorithms is a tree of clusters. Depend- 
ing on the level at which we cut the tree, a different clustering of the data is obtained. 
On the other hand density-based [3, 4, 8] and grid-based algorithms [12] suitably 
handle arbitrary shaped collections of points (e.g. ellipsoidal, spiral, cylindrical) as 
well as clusters of different sizes. Moreover, they can efficiently separate noise (out- 
liers). However, most of these algorithms are sensitive to some input parameters so as 
to require careful selection of their values. 

It is obvious from above discussion that the final partition of a data set requires 
some sort of evaluation in most applications [16]. Then, an important issue in cluster- 
ing is to find out the number of clusters that give the optimal partitioning (i.e, the par- 
titioning that best fits the real partitions of the data set). Though this is an important 
problem that causes much discussion, the formal methods for finding the optimal 
number of clusters in a data set are few [10, 13, 14, 15]. Moreover, all clustering algo- 
rithms are not efficient for all applications, which is why a diverse of algorithms has 
been developed. Depending on the clustering criterion and the ability to handle the 
special requirements of an application a clustering algorithm can be considered more 
efficient in a certain context (e.g. spatial data, business, medicine etc). However, the 
issue of cluster validity is rather under-addressed in the area of databases and data 
mining applications, while there are no efforts regarding the evaluation of clustering 
schemes defined by different clustering algorithms. 

Further more, in most algorithms’ experimental evaluations [1, 4, 6, 7, 8, 11] 2D- 
data sets are used in order the reader is able to visually verify the validity of the re- 
sults (i.e., how well the clustering algorithm discovered the clusters of the data set). It 
is clear that visualization of the data set is a crucial verification of the clustering re- 
sults. In the case of large multidimensional data sets (e.g. more than three dimensions) 
effective visualization of the data set can be difficult. Moreover the perception of 
clusters using available visualization tools is a difficult task for the humans that are 
not accustomed to higher dimensional spaces. 

Assuming that the data set includes distinct partitions (i.e., inherently supports 
clustering tendency), the above issues become very important. In the sequel, we show 
that different input parameters values of clustering algorithms may result in good or 
bad results in partitioning the data set. Moreover, it is clear that some algorithms may 
fail to discover the actual partitioning of a data set though the correct number of 
clusters is considered. 

For instance in Fig. 1 and Fig. 2 we can see the way different algorithms (DBSCAN 
[4], K-Means [2]) partition a data set having different input parameter values. More- 
over, it is clear from Fig. la that K-means may partition the data set into the correct 
number of clusters (i.e., three clusters) but in a wrong way. On the other hand, 
DBSCAN (see Fig. Ic) is more efficient since it partitioned the data set in the inherent 
three clusters under the consideration of the suitable input parameters’ values. As it is 
evident, if there is no visual perception of the clusters it is impossible to assess the 
validity of the partitioning. It is important then to be able to choose the optimal parti- 
tioning of a data set as a result of applying different algorithms with different input 
parameter values. 

What is then needed is a visual-aids-free assessment of some objective criterion, in- 
dicating the validity of the results of a clustering algorithm application on a poten- 
tially high dimensional data set. In this paper we propose an evaluation procedure 
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based on a cluster validity index (S_Dbw). Assuming a data set S, the index enables 
the selection of the clustering algorithm and its input parameter values so as to result 
in the best partitioning of S. 

The remainder of the paper is organized as follows. In the next section we motivate 
and define the validity index, while in Section 3 we provide a theoretical evaluation of 
S_Dbw based on its definition. Then, in Section 4 an experimental evaluation of our 
approach for selecting the algorithm that gives the optimal partitioning of a data set is 
presented. For the experimental study both synthetic and real data sets are used. We 
conclude in Section 5 by briefly presenting our contributions and indicate directions 
for further research. 



2 Selecting the Best Algorithm 

One of the most widely studied problems in area of knowledge discovery is the identi- 
fication of clusters, i.e., dense region in multidimensional data sets. This is also the 
subject of cluster analysis. Clustering is perceived as a complicated problem since 
depending on the application domain and the feature of data the concept of cluster 
may be different. Thus to satisfy the requirements of diverse application domains, a 
multitude of clustering methods are developed and are available in the literature. 
However, the problem of selecting the algorithm that may result in the partitioning 
that best fits a data set is under-addressed. In the sequel, we present our approach and 
we define the validity index based on which we may evaluate the clustering schemes 
under consideration and select the one best fits the data. 

2.1 An Approach of Clustering Schemes’ Evaluation 

The criteria widely accepted for evaluating the partitioning a data set are: i. the sepa- 
ration of the clusters, and ii. their compactness. In general terms we want clusters 
whose members have a high degree of similarity (or in geometrical terms are close to 
each other) while we want the clusters themselves to be widely spread. 

As we have already mentioned, there are cases that an algorithm may falsely parti- 
tion a data set, whereas only specific values for the algorithms’ input parameters lead 
to optimal partitioning of the data set. Here the term “optimal” implies parameters 
that lead to partitions that are as close as possible (in terms of similarity) to the actual 
partitions of the data set. 

Then our objective is the definition of a procedure for assessing the quality of parti- 
tioning as defined by a number of clustering algorithms. More specifically, a validity 
index is defined which based on the features of the clusters may evaluate the resulting 
clustering schemes as defined by the algorithm under consideration. Then, the algo- 
rithm and the respective set of input parameters resulting in the optimal partitioning of 
a data set may be selected. 

Let a data set S and Alg={algj I i=l,..,k} a set of widely known clustering algorithms. 
Palgi denotes the set of input parameters of an algorithm, algi. Applying a clustering 
algorithm algi to S while their parameters take values in Palgi a set of clustering 
schemes is defined, let Calgj ={cp_algilpe Palgi}. Then, the clustering schemes are 
evaluated based on a validity index we define, S_Dbw. A list of the index values is 
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maintained, denoted as index, for each set Calgi defined by the available algorithms. 
The main steps of our approach can be described as follows: 

Stepl . For all algieAlg 

Stepl . 1 Define the range of values for Palgi . Let [Pmax/ 
Pminl 

Stepl. 2 For p =p„,in to p^a,, 

Cp_algi <- algi (p) ; 

no <- number of clusters in Cp_algi ; 
index [algj^] . add (S_Dbw (nc) ) ; 

End for 
End for 

Step2 . index_opt <- min^igigaigl index [algi] } ; 

opt_cl <- Cp_algi with S_Dbw value equal to index_opt; 
best_algi <- algi resulting in opt_cl; 

In the sequel, we discuss in more detail the validity index which our approach use in 
order to select the clustering algorithm resulting in optimal partitioning of a data set. 



2.2 Validity Index Definition 

We define our validity index, S_Dbw, combining both clustering criteria (compact- 
ness and separation) taking also in account density. In the following we formalize our 
clustering validity index based on: i. clusters’ compactness (in terms of intra-cluster 
variance), and ii. density between clusters (in terms of inter-cluster density). 

Let D={vjl i=l,..., c} a partitioning of a data set S into c clusters where Vj is the center 
of i cluster as it results from applying a clustering algorithm algj to S. 

Let stdev the average standard deviation of clusters defined 

stdev = — 
c 

Then the overall inter-cluster density is defined as: 




Definition 1. Inter-cluster Density (ID) - It evaluates the average density in the re- 
gion among clusters in relation with the density of the clusters. The goal is the density 
among clusters to be significant low in comparison with the density in the considered 
clusters. Then, we can define inter-cluster density as follows: 



Dens _ bw{c) - 



1 



density {u-) 



I I 

c . (c - 1 ) ’=1 ’=1 max{ density (v,. ), density (v ■ )} 

# 



% 



( 1 ) 



where V;, Vj centers of clusters C;, Cj, respectively and Uy the middle point of the line 
segment defined by the clusters’ centers Vj, Vj . The term density(u) is given by equa- 
tion (2): 



densityiu )= [ f{x,,u ), 

1=1 



( 2 ) 



where n.j = number of tuples that belong to the clusters c j and Cj , i.e., x, £ c, u c S 



^ The term llxll is defined as : llxll = (x^x)^^^, where d dimension of x vector. 
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represents the number of points in the neighbourhood of u. In our work, the 
neighbourhood of a data point, u, is defined to be a hyper sphere with center u and 
radius the average standard deviation of the clusters, stdev. More specifically, the 
function is defined as: 

0, if d(x, u) > stdev 

f (x,u) = ) (2) 

( 1, otherwise ' 

It is obvious that a point belongs to the neighborhood of u if its distance from u is 
smaller than the average standard deviation of clusters. Here we assume that the data 
have been scaled to consider all dimensions (bringing them into comparable ranges) 
as equally important during the process of finding the neighbors of a multidimen- 
sional point [2], 

Definition 2. Intra-cluster variance - Average scattering for clusters. The average 
scattering for clusters is defined as: 

( 4 ) 

= N~ 



The term ! (" ) is the variance of a data set; and its p,h dimension is defined as follows: 






— ! Yi - X 
n fc = i 



where x is the 



dimension of 



X =- ! x^,Vx^eX 

nk=i 



The term ! (vj) is the variance of cluster C; and its pt^ dimension is given by 

2 / 



= ! G 



Then the validity index S_Dbw is defined as: 

S_Dbw = Scat(c) H- Dens_bw(c) (5) 

The definition of S_Dbw indicates that both criteria of “good” clustering (i.e., com- 
pactness and separation) are properly combined, enabling reliable evaluation of clus- 
tering results. The first term of S_Dbw, Scat(c), indicates the average scattering 
within c clusters. A small value of this term is an indication of compact clusters. As 
the scattering within clusters increases (i.e., they become less compact) the value of 
Scat(c) also increases and therefore it is a good indication of the compactness of clus- 
ters. Densjbw(c) indicates the average number of points between the c clusters (i.e., 
an indication of inter-cluster density) in relation with density within clusters. A small 
Dens_bw(c) value indicates well-separated clusters. The number of clusters, c, that 
minimizes the above index can be considered as an optimal value for the number of 
clusters present in the data set. 
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Fig. 3. A data set S partitioned in three (a) two (h) and four clusters (c, d) 



3 Integrity Issues 

In this section we evaluate the integrity of the validity index S_Dbw on which our 
approach based as regards their ability to select the best partitioning among these pro- 
posed by the clustering algorithms. In the following lemmas we summarize the differ- 
ent cases of clustering results giving also their respective proof sketches. 

Let a data set S containing convex clusters (as in Fig. a) and various ways to parti- 
tion it using different clustering algorithms (Figure 3 b-d). Let the optimal (natural) 
partitioning of data set S (as it is appeared in Figure 3 a) in three clusters. The number 
of clusters as it emanates from the case of optimal partitioning is further called “cor- 
rect number of clusters”. We assume that the data set is evenly distributed, i.e., on 
average similar number of data points are found for each surface unit in the clusters. 

Lemma 1 : Assume a data set S with convex clusters and a clustering algorithm A 
applied repetitively to S, each time with dijferent input parameter values P,-, resulting 
in dijferent partitions D; of S. The value of S_Dbw is minimized when the correct 
number of clusters is found. 

Proof: Let n be the correct number of clusters of the data set S corresponding to the 

partitioning Dj (optimal partitioning of S); Di(n, S) = {cdu}, i=l,--,n 

and m the number of clusters of another partitioning D2 of the same data set: D2(m, S) 

= {CD2j}J = l,-,m. 

Let S_Dbwoi and S_Dbwo2 be the values of the validity index for the respective parti- 
tioning schemes. Then, we consider the following cases: 

i) Assume D2to be a partitioning where more than the actual clusters are formed (i.e., 
m>n). Moreover, parts of the actual clusters (corresponding to D]) are grouped into 
clusters of D2 (as in Fig.d). Let fCoi ={fcDip I p=l, nfrl fcoip c Cdu, i=l,...,n} a 
set of fractions of clusters in Dp Similarly, we define fCD2={fco2k I k=l, ..., nfr 2 , 
fcD2k£CD2j, m). Then: 
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Fig. 4. A data set S partitioned correctly (a) and falsely (b) in three clusters 



a) 3 Co 2 j: Cdh = '^fcnip > where p=pi,...., p„, pi >=1 and p„<= nfrl, nfrl is the num- 
ber of considered fractions of clusters in Di, 

b) 3 Cdi;: Coii = fjfco 2 k > where k=ki,...., k„, ki >=1 and k„<= nfr2, where nfr2 is the 
number of considered fractions of clusters in D 2 , 

In this case, some of the clusters in D 2 include regions of low density (for instance 
cluster 3 in Fig.d). Thus, the value of the first term of the index related to intra- 
cluster variance of D 2 increases as compared to the intra-cluster variance of Dj 
(i.e., Scat(m) > Scat(n)). On the other hand, the second term (inter-cluster density) 
is also increasing as compared to the corresponding term of index for Dj (i.e., 
Dens_bw(m) > Dens_bw(n)). This is because some of the clusters in Di are split 
and therefore there are border areas between clusters that are of high density (e.g., 
clusters 1 and 3 in Figure 3d). Then, since both S_Dbw terms regarding D 2 parti- 
tioning increase we conclude that S_Dbwoi < S_Dbwo 2 - 
ii) Let D 2 be a partitioning where more clusters than in Di are formed (i.e., m>n). 
Also, we assume that at least one of the clusters in Dy is split to more than one in 
D 2 while no parts of Di clusters are grouped into D 2 clusters (as in Fig.c), i.e., 3 
Con : Cdu = u Co 2 j, j=ki,. . ., k and ki >=1, k <= m. In this case, the value of the first 
term of the index related to intra-cluster variance slightly decreases compared to 
the corresponding term of Dy since the clusters in D 2 are more compact. As a con- 
sequence Scat(m)<=Scat(n). On the other hand, the second term (inter-cluster den- 
sity) is increasing as some of the clusters in Dy are split and therefore there are 
borders between clusters that are of high density (for instance clusters 1 and 3 in 
Fig.c). Then Dens(m) » Dens(n). Based on the above discussion and taking in 
account that the increase of inter-cluster density is significantly higher than the 
decrease of intra-cluster variance we may conclude that S_Dbwoi < S_Dbwo 2 - 

hi) Let D 2 be a partitioning with less clusters than in Dy (m<n) and two or more of 
the clusters in Di are grouped to a cluster in D 2 (as in Fig.b.). Then, 3 Cd 2 j: Cm] = 
k>'CDii, where i=pi, . . ., p and pi >= 1, p <= n. In this case, the value of the first term 
of the index related to intra-cluster variance increases as compared to the value of 
corresponding term of Dy since the clusters in D 2 contain regions of low density. 
As a consequence, Scat(m)»Scat(n). On the other hand, the second term of the 
index (inter-cluster density) is slightly decreasing or remains vaguely the same as 
compared to the corresponding term of Dy (i.e., Dens_bw(n)=Dens_bw(m)). This 
is because similarly to the case of the Dy partitioning (Figure 3 a) there are no bor- 
ders between clusters in D 2 that are of high density. Then, based on the above dis- 
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cussion and considering that the increase of intra-cluster variance is significantly 

higher than the decrease of inter-cluster density, we may conclude that S_Dbwoi < 

S_Dbwo2- 

Lemma 2: Assume a data set S containing convex clusters and a clustering algorithm 
A applied repetitively to S, each time with different parameter values Pi, resulting in 
different partitions D, of S. For each D, it is true that the correct number of clusters is 
found. The value S_Dbw is minimized when the optimal partitions are found for the 
correct number of clusters. 

Proof:. We consider D 2 to he a partitioning with the same number of clusters as the 
optimal one D1 (Figure 4a), (i.e., m=n). Furthermore, we assume that one or more of 
the actual clusters corresponding to Di are split and their parts are grouped into differ- 
ent clusters in D 2 (as in Fig. b). That is, if fCni ={fcDip I P=l, nfrl fcDipCCnn, 
i=l,...,n) a set of clusters fractions in Di then 3 Cd 2 j: Cd 2 j = '^fcnu , i=Pn P and 
Pi>=l, p<=n. In this case, the clusters in Dp contain regions of low density (e.g. clus- 
ter 1 in Figure 4b) and as a consequence the value of the first term of the index, intra- 
cluster variance, increases as compared to the corresponding term of Dj, i.e., 
Scat(m)>Scat(n). On the other hand, some of the clusters in Dp are split and therefore 
there are border areas between clusters that are of high density (for instance clusters 1, 
3 and 1, 2 in Fig. b). Therefore, the second term (inter-cluster density) of Dp is also 
increasing as compared to the one of Dj, i.e., Dens_bw(m)>Dens_bw(n). Based on the 
above discussion it is obvious that S_Dbwoi < S_Dbwo 2 - 

3.1 Time Complexity 

The complexity of our approach depends on the complexity of the clustering algo- 
rithms used for defining the clustering schemes and the complexity of the validity 
index S_Dbw. More specifically, assume a set of clustering algorithms, Alg = {algi , 
..., algi;}. Considering the complexity of each of the algorithms, O(algj) (where, i =1, 
..., k), we may define the complexity of the whole clustering process, i.e., the process 
for defining the clustering schemes based on the algorithms under consideration, let 
O(Alg). Moreover, the complexity of the index is based on its two terms as defined in 
(1) and ( 4). Assuming d is the number of attributes (data set dimension), c is the 
number of clusters, n is the number of database tuples. The intra-cluster variance 
complexity is 0{ndc) while the complexity of inter-cluster density is O(ndc^). Then 
S_Dbw complexity is Oindc^). Usually, c, d « n, therefore the complexity of our 
index for a specific clustering scheme is 0(n). Finally, the complexity of the whole 
procedure for finding the clustering scheme best fitting a data set, S, among these 
proposed by the k algorithms and as a consequence to find the best clustering algo- 
rithm is 0{0(Alg)+n). 



4 Experimental Evaluation 

In this section, the proposed approach for selecting the clustering algorithm that re- 
sults in the optimal clustering scheme for a data set is experimentally tested. In our 
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study, we consider three well-known algorithms of different clustering categories, 
partitional, density-based and hierarchical (K-means, DBSCAN and CURE respec- 
tively). Also, we experiment with real and synthetic multidimensional data sets con- 
taining different number of clusters. In the sequel, due to lack of space, we present 
only some representative examples of our experimental study. 

We consider two 2-dimensional data sets containing four and seven clusters respec- 
tively as Fig. 5, Fig. 6 depict. Applying the above mentioned clustering algorithms 
(i.e., K-means, DBSCAN and CURE) to these data sets three sets of clustering 
schemes are produced. Each of the clustering schemes’ sets corresponds to the 
clustering results of an algorithm for different values of its input parameters. Then we 
evaluate the defined clustering schemes based on the proposed index, S_Dbw. The 
clustering scheme at which S_Dbw is minimized indicates the best partitioning of 
data set and as a consequence the best algorithm and its input parameters’ values re- 
sulting in it. 




0 20 40 60 80 



Fig. 5. DataSetl - A Synthetic Data Set containing four clusters 
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Fig. 6. DataSet2 - A partitioning of data set into seven clusters using (a) CURE or DBSCAN, 
(b) K-Means 





Table 1 and Table 2 summarize the values of S_Dbw in each of the above cases. 
More specifically. Table 1 depicts that S_Dbw takes its minimum value when Data- 
Setl is partitioned into four clusters (i.e., actual clusters in data set) no matter which 
algorithm is used. This means that according to our approach the optimal partitioning 
of DataSetl is four clusters and all three algorithms partitioned it in a right way (i.e., 
they find the actual clusters in the data set). Figure 5 is a visualization of DataSetl 
while the cycles indicate the proposed partitioning. 




A Data Set Oriented Approach for Clustering Algorithm Selection 



175 



Table 1. The values of SD_bw for DataSet 1 clustering schemes 





K-means 


DBSCAN 


CURE 
r=10, a=0.3 


No 

clusters 


Input 


S_Dbw 

Value 


Input 


S_Dbw 

Value 


Input 


S_Dbw 

Value 


8 


C=8 


0.124 




- 


C=8 


0.108 


7 


C=7 


0.118 




- 


C=7 


0.103 


6 


C=6 


0.0712 


Eps=20 MinPts=4 


0.087 


C=6 


0.082 


5 


C=5 


0.086 


Eps=10 MinPts=10 


0.086 


C=5 


0.091 


4 


C=4 


0.0104 


Eds= 40 MinPts=10 


0.0104 


C=4 


0.0104 


3 


C=3 


0.0312 


Eps=10MinPts=15 


0.031 


C=3 


0.031 


2 


C=2 


0.1262 


Eps=20 MinPts=15 


0.1262 


C=2 


0.126 



Table 2. The values of SD_bw for DataSet2 clustering schemes 





K-means 


DBSCAN 


CURE 

r=10, a=0.3 


No 

clusters 


Input 


S_Dbw 

Vfdue 


Input 


S_Dbw 

Value 


Input 


S_Dbw 

Value 


8 


C=8 


0.66 


Eps=8 MinPts=10 


0.0333 


C=8 


0.0517 


7 


C=7 


0.6004 


Eps=20 

MinPts=4 


0.0009 


C=7 


0.0009 


6 


C=6 


0.575 


Eps=80 MinPts=4 


0.0018 


C=6 


0.0019 


5 


C=5 


0.491 




- 


C=5 


0.0051 


4 


C=4 


0.365 




- 


C=4 


0.032 


3 


C=3 


0.045 




- 


C=3 


0.073 


2 


C=2 


0.854 




- 


C=2 


0.796 



Moreover, according to Table 2, in case of DataSet2, S_Dbw takes its minimum 
value for the partitioning of seven clusters defined by DBSCAN and CURE. This is 
also the number of actual clusters in the data set. Fig. a presents the partitioning of 
Dataset2 into seven clusters as defined by DBSCAN and CURE while the clustering 
result of K-Means into seven clusters is presented in Fig. b. It is obvious that K-Means 
fails to partition DataSet2 properly even in case that the correct number of clusters 
(i.e., c=7) is considered. 

The value of our approach is more evident in case of multidimensional data sets 
where efficient visualization is difficult or even impossible. We consider two syn- 
thetic data sets, four- and six-dimensional (further referred as DataSet3 and DataSetd 
respectively). We may discover four clusters in DataSet3 while the number of clusters 
occurred in DataSet4 is two. Assuming DataSet3 our approach proposes the partition- 
ing of four clusters defined by DBSCAN and CURE as the best fitting the data under 
consideration. Four clusters is the value at which S_Dbw takes its minimum value 
(see Table 3) and it is also the actual number of clusters in data set. Similarly, in case 
of DataSet4 SD_bw takes its minimum value for the partitioning of two clusters de- 
fined by DBSCAN (see Table 4). Considering the clustering schemes proposed by 
CURE, S_Dbw is minimized when c=5. Thus, the best partitioning among these de- 
fined by CURE is five clusters. It is obvious that CURE seems to fail to partition 
Dataset4 in a right way, even in case that the actual number of clusters (i.e., two) is 
considered. 
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Table 3. The values of S_Dbw for 4D-data set clustering schemes 







K-means 


DBSCAN 




CURE 

r =10, a=0.3 


No 

clus- 

ters 


Input 


S_Dbw Value 


Input 


S_Dbw 

Value 


Input 


S_Dbw 

Value 


9 


C=9 


2.3555556727820735E7 


Eps=15 MinPts=4 


0.042 


C=9 


0.2739 


8 


C=8 


0.618 


Eps=15 MinPts=10 


0.226 


C=8 


0.256 


7 


C=7 


1.295 


- 


- 


C=7 


0.1765 


6 


C=6 


1800005.464 


Eps=20 MinPts=4 


0.0365 


C=6 


0.1899 


5 


C=5 


1.02165327 


Eps=10 MinPts=10 


0.0311 


C=5 


0.0859 


4 


C=4 


295333334.99 


Eds= 40 MinPts=10 


0.0013 


C=4 


0.0013 


3 


C=3 


1.031 




- 


C=3 


0.0149 


2 


C=2 


4.197 




- 


C=2 


0.672 



Table 4. The values of SD_bw for the 6D-data set clustering schemes 





K-means 


DBSCAN 


CURE 

r =10, a=0.3 


No clusters 


Input 


S_Dbw Value 


Input 


S_Dbw Value 


Input 


S_Dbw 

Value 


8 


n 

II 

oo 


0.689 


- 


- 


C=8 


0.291 


7 


C=7 


0.653 


- 


- 


C=7 


0.338 


6 


C=6 


0.662 


- 


- 


C=6 


0.322 


5 


C=5 


0.669 


- 


- 


C=5 


0.239 


4 


C=4 


0.58 


Eps=5 MinPts=4 


0.31 


C=4 


0.805 


3 


C=3 


0.619 


Eps=25 MinPts=4 


0.233 


C=3 


1.401 


2 


C=2 


0.114 


Ecs=35 MinPts=4 


0.096 


C=2 


2.3249 



Finally, we evaluate our approach using real data sets. One of the data sets, we 
studied, contains three parts of Greek roads network [17]. The roads are represented 
by their MBR approximations’ vertices. Figure 7 is a visualization of this data. The 



Table 5. The values of SD_bw for Real_Datal (Figure 7) 





K-means 


DBSCAN 


CURE 

r =10, a=0.3 


No 

Clusters 


Input 


S_Dbw Value 


Input 


S_Dbw Value 


Input 


S_Dbw 

Value 


8 


C=8 


0.179 




- 


C=8 


0.1694 


7 


C=7 


0.237 




- 


C=7 


0.1914 


6 


C=6 


0.343 




- 


C=6 


0.2248 


5 


C=5 


0.367 




- 


C=5 


0.1621 


4 


C=4 


0.35 


Eps=50000 MinPts=10 


0.192 


C=4 


0.1349 


3 


C=3 


0.083 


Eds= 30000 MinPts=10 


0.084 


C=3 


0.1086 


2 


C=2 


0.918 


Eps^lOOOO MinPts=10 


0.891 


C=2 


1.0508 



behaviour of S_Dbw regarding the different clustering schemes (i.e., number of clus- 
ters) defined by the above mentioned algorithms are depicted in Table 5. It is clear 
that S_Dbw indicates the correct number of clusters (three) as the best partitioning for 
the data set when K-Means or DBSCAN is used. 




A Data Set Oriented Approach for Clustering Algorithm Selection 



177 



4 .fOE 
4,40E'^D^ 

4,)0E4D^ 

4,1dE«-D^ 

4.I0E'^06 
4.00E4D6 

) ,&0E «-D6 

]DDD0D 4DDODD fDODDD 6ODDD0 7DODDO 

Fig. 7. Real_Datal- A data set representing a part of Greek network 
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Fig. 8. Real_Data2 - A data set representing towns and villages of Greek islands 





Table 6. The values of SD_hw for Real_Data2 (Figure 8) 





K-means 


DBSCAN 


CURE 

r =10, a=0.3 


No 

Clusters 


Input 


S_Dbw Value 


Input 


S_Dbw Value 


Input 


S_DbwV alue 


11 


C=ll 


0.039 


- 


- 


C=ll 


0.114 


10 


C=10 


0.0348 


Eps= 2000, MinPts=4 


0.113 


C=10 


0.08 


9 


C=9 


0.0503 


Eps=2000 MinPts=10 


0.072 


C=9 


0.049 


8 


C=8 


0.088 


Eps= 3000 MinPts=4 


0.123 


C=8 


0.107 


7 


C=7 


0.112 


- 


- 


C=7 


0.121 


6 


C=6 


0.095 


Eps=4000 MinPts=4 


0.228 


C=6 


0.32 


5 


C=5 


0.149 


- 


- 


C=5 


0.266 


4 


C=4 


0.621 


Eps=5000 MinPts=4 


0.561 


C=4 


0.511 


3 


C=3 


0.539 


Eps=6000 MinPts=4 


0.251 


C=3 


0.389 


2 


C=2 


1.150 


Eps=7000 MinPts=4 


0.621 


C=2 


0.617 
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We carried out, a similar experiment using a data set representing the towns and vil- 
lages of a group of Greek islands [17]. A visualization of this data is presented in Fig- 
ure 8. Based on Table 6, we observe that SD_bw takes its minimum value for the 
clustering scheme of ten clusters as defined by K-Means, which is a “good” approxi- 
mation of the inherent clusters in underlying data. Both CURE and DBSCAN fail to 
define a good partitioning for the data under consideration. 

Then, it is clear that S_Dbw can assist to select the clustering algorithm resulting 
in the optimal partitioning of the data set under consideration as well as the input pa- 
rameters’ values of the selected algorithm based on which the optimal partitioning is 
defined. 



5 Conclusions 

Clustering algorithms may result in different clustering schemes under the considera- 
tion of different assumption. Moreover, there are cases that a clustering algorithm 
may partition a data set into the correct number of clusters but in a wrong way. In 
most of the cases the users visually verify the clustering results. In the case of volu- 
minous and/or multidimensional data sets where efficient visualization is difficult or 
even impossible, it becomes tedious to know if the results of clustering are valid or 
not. 

In this paper we addressed the important issue of assessing the validity of cluster- 
ing algorithms’ results, so as to select the algorithm and its parameters values for 
which the optimal partitioning of a data set is defined (assuming that the data set pre- 
sents clustering tendency). We propose a clustering validity approach based on a new 
validity index (S_Dbw) for assessing the results of clustering algorithms. The index is 
optimised for data sets that include compact and well-separated clusters. The com- 
pactness of the data set is measured by the cluster variance where as the separation by 
the density between clusters. 

We have proved our approach reliability and value i. theoretically, by illustrating 
the intuition behind it and ii. experimentally, using various data sets of non-standard 
(but in general non-convex) geometries covering also the multidimensional case. The 
index, as indicated by experiments, may always indicate the algorithm and the respec- 
tive input parameters’ values so as to find the inherent clusters for a data set. 

Further Work. As we mentioned earlier the validity assessment index we proposed 
in this paper works better when the clusters are mostly compact. It does not work 
properly in the case of clusters of non-convex (i.e., rings) or extraordinarily curved 
geometry. We are going to work on this issue as the density and its continuity is not 
any more sufficient criteria. We plan an extension of this effort to be directed towards 
an integrated algorithm for cluster discovery putting emphasis on the geometric fea- 
tures of clusters, using sets of representative points, or even multidimensional curves 
rather than a single center point. 
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Abstract. Meta- learning for model selection, as reported in the sym- 
bolic machine learning community, can be described as follows. First, it is 
cast as a purely data-driven predictive task. Second, it typically relies on 
a mapping of dataset characteristics to some measure of generalization 
performance (e.g., error). Third, it tends to ignore the role of algorithm 
parameters by relying mostly on default settings. This paper describes 
a case-based system for model selection which combines knowledge and 
data in selecting a (set of) algorithm(s) to recommend for a given task. 
The knowledge consists mainly of the similarity measures used to retrieve 
records of past learning experiences as well as profiles of learning algo- 
rithms incorporated into the conceptual meta-model. In addition to the 
usual dataset characteristics and error rates, the case base includes ob- 
jects describing the evaluation strategy and the learner parameters used. 
These have two major roles: they ensure valid and meaningful compar- 
isons between independently reported findings, and they facilitate repli- 
cation of past experiments. Finally, the case-based meta-learner can be 
used not only as a predictive tool but also as an exploratory tool for 
gaining further insight into previously tested algorithms and datasets. 



1 Issues and Objectives 

Broadly speaking, the model selection problem concerns the choice of an appro- 
priate model for a given learning task. However, the term has been used with 
varying nuances, or at least shifting emphases, among the different communi- 
ties now involved in data mining. The divergence seems to concern the level of 
generality at which one situates the search for the appropriate model. Among 
statisticians, model selection takes place within a given model family: it typically 
refers to the task of creating a fully specified instance of that family, called the 
fitted model, whose complexity has been fine-tuned to the problem at hand j0|. 
This convention has been carried over to neural network (NN) learning, where 
model complexity is a function of parameters specific to a family of network 
architectures - e.g., the number of hidden layers and units for feedforward NNs 
or the number of centers for Radial Basis Function Networks uni In the sym- 
bolic machine learning (ML) community, the term model selection designates 
the task of selecting a learning algorithm for a specific application; thus search 
is conducted in the space of all known or available model families and the se- 
lection of a learning algorithm circumscibes a model class or family from which 
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the learned model will eventually emerge. However, the task of finding the most 
appropriate model instantiation from the selected family has been relatively ne- 
glected. In other words, ML researchers tend to end where statisticians and NN 
researchers tend to start. Thus, while the latter seldom envisage model selection 
beyond the frontiers of a specific model family (e.g. multilayer perceptrons), the 
former often overlook the role of model parameters when doing cross-algorithm 
comparisons; the Statlog m and Metal projects mi , for instance, have adopted 
the expedient of systematically evaluating learning algorithms with their default 
parameter settings. In the broader perspective of data mining, however, these 
diverse definitions should be seen as partial and complementary perspectives on 
a complex multifaceted task. The first research objective of this work is to in- 
tegrate the choice of learning methods as well as model parameters in a unified 
framework for model selection. 

Given that no learning algorithm can systematically outperform all others 
the model selection problem arises anew with each learning task. The most 
common approach consists in experimenting with a number of alternative meth- 
ods and models and then selecting one which maximizes certain performance 
measures. However, with the increasing number and diversity of learning meth- 
ods available, exhaustive experimentation is simply out of the question. There 
is a need to limit the initial set of candidate algorithms on the basis of the 
given task or data, and one can legitimately hope that algorithms which have 
proved useful for a certain class of tasks will confirm their utility on other, simi- 
lar tasks. Hence the idea that by examining results of past learning experiences, 
one might determine broad mappings between learning methods and task classes 
so that model selection does not start from scratch each time. Meta- learning is 
an attempt at automating the realization of this idea. 

The meta-learning approach to model selection has been typically based on 
characterizations of the application dataset. In Statlog, meta-learning is a classi- 
fication task which maps dataset attributes to a binary target variable indicating 
the applicability or non-applicability of a learning algorithm. Data characteris- 
tics can be descriptive summary statistics (e.g., number of classes, number of in- 
stances, average skew of predictive variables) or information-theoretic measures 
(e.g., average joint entropy of predictive and target variables). There have since 
been a number of attempts to extend or refine the Statlog approach. In the Metal 
project, for instance, datasets have also been characterized by the error rates 
observed when they are fed into simple and efficient learning algorithms called 
landmarkers (e.g.. Naive Bayes or linear discriminants) 1 1 j . However, the basic 
idea underyling the Statlog approach remains intact - i.e., that meta-attributes 
describing the application task/dataset suffice to predict the applicability or the 
relative performance of the candidate learning algorithms. 

To broaden the range of meta-level predictors, we propose algorithm profiling 
as a complement to dataset characterization in general and to landmarking in 
particular. Landmarking uses specially selected learners to uncover information 
about the nature of the learning task or data. By contrast, algorithm profiling 
uses specially designed datasets to deliver information about a learning algorithm 



182 



M. Hilario and A. Kalousis 



- its bias/variance profile, scalability, tolerance to noise, irrelevant variables or 
missing data. While landmarking attempts to describe a dataset in terms of 
the areas of expertise to which it belongs (as witnessed by the landmarkers 
which perform well on it), algorithm profiling strives to describe in concrete, 
quantitative terms what makes up the region of expertise of a learning algorithm. 
The second objective is to complement dataset characterizations with algorithm 
profiles as predictors in the meta-learning process. 

There is a fundamental difference between dataset and algorithm character- 
izations behind their apparent symmetry. Dataset characteristics are meta-data 
extracted from individual datasets whereas algorithm profiles embody meta- 
knowledge about learning algorithms which can be brought to bear on model 
selection over different datasets. The essential difference lies in the fact that al- 
gorithm characteristics have been derived via a process of abstraction and/or 
generalization. This generalized knowledge may be borrowed from the collective 
store of expertise in the domain or alternatively abstracted via controlled exper- 
imentation, as in the case of meta-attributes concerning sensitivity to missing or 
irrelevant data. In addition to prior meta-knowledge about learning/modelling 
tools, a domain expert’s background knowledge of her application domain can 
be expressed in the form of constraints that should be taken into account in the 
search for an appropriate tool. The third objective is to strike an effective and ef- 
ficient balance between meta-learning and the use of prior (base- and meta-level) 
knowledge in the model selection process. 

With the introduction of prior knowledge about learning algorithms and 
application domains, meta-learning becomes a multi-relational task which calls 
for greater expressive power than that offered by attribute value vectors. In 
this paper we describe an object-oriented case-based meta-learning assistant 
which addresses the issues and objectives described above. Section 2 describes 
the knowledge embedded in the system’s underlying conceptual (meta) model. 
Section 3 describes the current implementation - the extensive case base gathered 
to date as well as the different ways in which it can be exploited - and proposes 
a strategy for evaluating its incremental meta-learning capabilities. Section 4 
summarizes and argues for its possible utility as a long-term meta-memory of 
machine learning experiments. 



2 The Embedded Knowledge 

This section focuses on the knowledge embedded in the meta-model of the learn- 
ing process. A simplified view of the conceptual schema is shown in Fig. ^ 

2.1 Modelling Processes 

The core of the meta-model is the modelling process, which depicts a specific 
learning episode or experiment. It is described by a number of performance 
measures such as the error rate, the training/testing time, and the size of the 
learned model or hypothesis. More importantly, the ModProcess object is the hub 
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Fig. 1. The conceptual metamodel (see Sect. 0 for explanation) 



which links together three main components according to a precise semantics: 
a modelling or learning tool (ModTool) is trained and tested on the given data 
(Dataset) following a particular evaluation strategy (EvalStrat). The structure 
and attributes of these three components comprise the background information 
which will be brought to bear in the meta-learning and model selection process. 



2.2 Datasets 

The object depicting a dataset can be seen as a simple, albeit extended, impor- 
tation of the Statlog dataset characteristics. These will not be described here 
(see H21 for a detailed discussion); rather, we present several major extensions 
which have been made possible by the structured representation adopted. In 
the Statlog formalism, only summary statistics over all predictive variables of 
a dataset could be recorded; the consequence was that atypical variables (e.g., 
irrelevant variables) were impossible to detect since the symptoms of this atyp- 
icality (e.g., an extremely low measure of mutual information with the target 
variable) somehow got dissolved in the overall statistics (e.g., average mutual in- 
formation) . Such problems disappear in a multi-relational representation, where 
a dataset can be characterized more thoroughly by the collection of objects that 
describe its component variables. 

In addition, both variables and datasets can be divided into subclasses 
and thus described only by features that are certain to make sense for the spe- 
cific category in question. One persistent problem of meta-learning in Statlog 
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and other propositional approaches is that different types of datasets are forced 
into a single attribute vector, with the result that many meta-attributes have 
missing values when they turn out to be non applicable to a certain data type. 
For instance, summary statistics of continuous variables (eg mean skewness, 
departure from normality) are not applicable to categorical variables; also, in- 
formation theoretic measures are often not computed for continuous variables. 
The result is that comparisons between symbolic, numeric, and mixed datasets 
become highly problematical. This difficulty is circumvented quite naturally in 
a typed multi-relational setting. Variables can be divided into subclasses along 
two dimensions. According to their role in the dataset, they are either predictors 
or targets; according to the data type, they are either continuous or discrete. 
Similarly, datasets may be symbolic, numeric, or mixed, depending on the types 
of their component variables. Along a different dimension, datasets are either 
labelled (e.g., for classification) or unlabelled (e.g., for association). Labelled 
datasets contain a number of attributes such as average joint entropy or average 
mutual information which make sense only for supervised learning tasks. 

2.3 Modelling Tools 

The ModTool class subsumes any fully implemented modelling or learning 
method which can be used for supervised or unsupervised knowledge discov- 
ery from data. Each tool is formalized as a ModTool sublcass. The tools used 
in our initial study are C5.0 in its tree (c50tree) and rule (c50rule) versions, an 
oblique decision tree (Itree) a sequential covering rule inducer (ripper) P|, 
the MLC-I— I- implementations of Naive Bayes (mIcNB) and IBl (mlcIBl) |7|, 
Joao Gama’s implementation of Fisher’s linear discriminant (lindiscr), and the 
Clementine implementations of radial basis function networks (clemRBFN) and 
backpropagation in multilayer perceptrons (clemMLPfl. All learning tools inherit 
the meta-features defined for the ModTool class; in addition, each subclass has 
its specific attributes corresponding to the tool’s user-tunable model and search 
parameters - e.g., pruning severity for (c50tree), the number of centers and their 
overlap for (clemRBFN). Each application of a learning tool is recorded as an 
instance of the corresponding subclass, and the actual parameters override the 
default values predefined in the conceptual schema. 

There are no intermediate subhierarchies of modelling tools according to 
learning paradigm or computational approach. This design option has been taken 
deliberately: since the aim of our meta-learner is to discover mappings of data 
and task types onto classes of learners, we have avoided a priori classifications 
that may hinder the discovery of novel or unexpected affinities or clusters among 
learning methods. On the other hand, we have tried to embed as much knowledge 
as possible about the biases, strengths and weaknesses of each learning tool. 

Representation and Approach. The simplest form of knowledge concerns 
the basic requirements, capabilities and limitations of each tool, which have 

^ The parenthesized names will be used throughout the rest of this paper to identify 
the specific implementations of the learning algorithms studied. 



Fusion of Met a- knowledge and Meta-data for Case-Based Model Selection 



185 



Table 1. Characterizing representation and approach of modeling tools 



ModTool 


Data 


Inc 


CH 


VH 


Par 


Meth 


Strat Cum 


c50rules 


NS 


N 


Y 


Seq 


Sym 


Logic 


E 


L 


c50tree 


NS 


N 


Y 


Seq 


Sym 


Logic 


E 


L 


clemMLP 


NS 


N 


N 


Par 


NN 


Thresh E 


M 


clemRBFN 


NS 


N 


N 


Par 


NN 


Comp 


E 


M 


lindiscr 


NS 


N 


N 


Par 


Stat 


Thresh E 


H 


Itree 


NS 


N 


N 


SP 


Sym 


Logic 


E 


M 


mlcibl 


NS 


Y 


N 


Par 


Sym 


Comp 


L 


M 


mlcnb 


NS 


N 


N 


Par 


Sym 


Comp 


E 


H 


ripper 


NS 


N 


N 


Seq 


Sym 


Logic 


E 


L 



generally been gathered from algorithm specifications or instruction manuals of 
the implemented tool. Attributes in this group indicate the type of data (Data 
in Tabled supported by a learning tool (N for numeric/continuous, S for sym- 
bolic/discrete, NS for both), whether the tool learns incrementally or not (Inc), 
or whether it can handle externally assigned costs (CH). These characteristics 
can be determined in a straightforward manner and they are usually invariant 
for all instantiations of a given learning tool. 

As tool specifications provide only a minimal characterization of learner func- 
tionality, knowledge of less obvious features (see last five columns of Table ^ has 
been gathered mainly from cumulative results of past research. We borrowed the 
paradigm-based categorization of learning algorithms (Par) as symbolic, statis- 
tical, or connectionist as well as Langley’s distinction between logical, competi- 
tive, and threshold-based learning approaches (Meth). From the point of view of 
learning strategy (Strat), modelling tools are either lazy or eager, depending on 
whether they simply store data, deferring learning until task execution time, or 
use given data to create a model in view of future task requests. Another dimen- 
sion is the way a learner handles input variables in the generalization process 
(VH). Sequential algorithms (e.g., decision trees) examine one input variable at 
a time, while parallel algorithms examine all input variables simultaneously m 
Neural networks are clearly parallel; so are instance-based learners and Naive 
Bayes classifiers, which aggregate distances or probabilities over all variable val- 
ues simultaneously. A third, hybrid category includes algorithms such as oblique 
decision trees which alternate between sequential and simultaneous processing 
of variables depending on the data subset examined at a node. 

An additional aspect of learning bias which has been brought to light by 
recent research is what Blocked |2| calls cumulativity (Cum). This is a gener- 
alization of the statistical concept of additivity: two features are cumulative if 
their effects are mutually independent, so that their combined effect is the trivial 
composition of their separate effects. The cumulativity of learning algorithms is 
nothing more than their ability to handle cumulativity of features, which can 
be discretized roughly on a three-step scale. Linear regressors and discriminants 
are naturally situated on the high end of the scale; so is Naive Bayes with its 
assumption of the class-conditional independence of predictors (the product of 
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Table 2. Characterizing resilience of modeling tools 



ModTool 


Var 


ErrSIrr 


TimeSIrr 


MCAR 


MAR 


cSOrules 


0.4503 


0.0324 


4.76 


0.2475 


0.2527 


cSOtree 


0.4548 


0.0337 


4.26 


0.2098 


0.2094 


clemMLP 


0.4230 


0.1292 


378.25 


? 


7 


clemRBFN 


0.3626 


0.0910 


9595.93 


7 


7 


lindiscr 


0.2308 


0.0351 


4.92 


0.1913 


0.1624 


Itree 


0.4154 


0.0413 


14.57 


0.1111 


0.1251 


mlcibl 


0.3868 


0.1347 


56.98 


0.2283 


0.2292 


mlcnb 


0.2273 


0 


3.64 


0.1073 


0.0940 


ripper 


0.3862 


0.0173 


98.67 


0.1310 


0.1444 



likelihoods and priors it maximizes translates directly to addivity of their respec- 
tive logarithms). At the other extreme, sequential learners like decision trees 
and rule induction systems allow for maximal interaction between variables and 
therefore have low cumulativity. Instance-based learners and neural networks 
occupy the midpoint of the scale. Neural nets handle cumulativity by means of 
linear combinations while handling interaction by superposing multiple layers 
and using nonlinear threshold functions. 

Resilience. A second group of characteristics concerns the resilience of a mod- 
elling tool, i.e., its capability of ensuring reliable performance despite variations 
in training conditions and especially in the training data. Resiliency characteris- 
tics reflect the sensitivity or tolerance of an algorithm to data characteristics or 
pathologies that are liable to affect performance adversely. Examples are stabil- 
ity, scalability, and resistance to noise, missing values, and irrelevant or redun- 
dant features. Contrary to representational and methodological meta-attributes, 
there is no consensus regarding the resilience of the ten learning tools included 
in our initial knowledge base. We thus undertook extensive experimental studies 
concerning their bias/ variance trade-off and their sensivity to missing values and 
irrelevant features. 

Table|3 shows the results of these experiments. The column labelled Var gives 
the proportion of variance in the generalization error. This was obtained by ap- 
plying Kohavi and Wolpert’s bias-variance decomposition method for zero-one 
loss |B| to each tool, averaged over 40 datasets from the UCI Machine Learn- 
ing Repository. Note that the variance measures given here concern each tool as 
applied with its default parameter settings. For instance, 0.4548 is the mean vari- 
ance observed for Clementine RBF networks with the default number of hidden 
units, i.e., 20; variances have also been recorded as the number of hidden units 
is varied from 5 to 150. The next two columns quantfy a learner’s sensitivity 
to irrelevant attributes as measured by its mean increase in generalization error 
(ErrSIrr) or in training time (TimeSIrr) for each additional percent of irrelevant 
attributes. These measures were obtained in a series of 10-fold cross-validation 
experiments on 43 UCI datasets; each learning algorithm was run with default 
parameters on the original datasets, then on 6 corrupted versions containing 
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respectively 5%, 10%, 20%, 30%, 40%, and 50% irrelevant features. A full dis- 
cussion of these experiments and the results is given in . The last two columns 
depict mean increase in generalization error with each per cent increase in miss- 
ing values - either values that are missing completely at random (MCAR) or 
missing at random (MAR). A value is said to be missing completely at random if 
its absence is completely independent of the dataset; it is missing at random if its 
absence depends, not on its own value, but on the value of some other attribute 
in the dataset m- Here again, we followed a strategy of “progressive corrup- 
tion” to observe how learners cope with incomplete data. Generalization error 
was estimated for each learner on the original datasets, then on five increasingly 
incomplete versions from which 5%, 10%, 20%, 30%, and 40% of the feature 
values were deleted. For the MCAR series, feature values were deleted randomly 
following a uniform distribution; for the MAR series, values of selected features 
were deleted conditional on the values of other attributes. The interested reader 
is referred to p] for details. 

It should be stressed that all characteristics as well as any conclusions drawn 
about a modelling tool concern the specific software implementation under study 
rather than the generic learning algorithm or method. It is well known that the 
specification of an algorithm leaves considerable flexibility in implementation, 
and differences between implementations of the same algorithm can turn out to 
be as significant as differences between distinct methods or algorithms. Thus, 
while we have taken pains to include a wide variety of learning approaches in 
our study, the findings reported should not be extrapolated from the individual 
tool implementation to the generic method without utmost precaution. 

Practicality. Finally, other features concern more practical issues of usability. 
They do not impact a learner’s generalization performance but can be used to 
pre-select tools on the basis of user preferences. Examples of such characteris- 
tics are the comprehensibility of the method, the interpretability of the learned 
model, or the degree to which model and search parameters are handled auto- 
matically. Since values of these meta-characteristics are qualitative and highly 
subjective, we assigned 5-level ordinal values on what we deemed to be intu- 
itively obvious grounds. For instance, parameter handling is rated very high for 
lindiscr, Itree, mlcibl, and micnb - tools which require absolutely no user-tuned 
parameter. When an algorithm involves user-tunable parameters, its rating de- 
pends on how well the algorithm performs without user intervention, that, when 
run with its default parameter settings. Thus c50tree and c50rules are marked 
high, clemMLP medium, and clemRBFN low (the default of 20 centers often leads 
to poor performance and even to downright failure to converge). As for method 
comprehensibility and result interpretability, we relied heavily on the Statlog 
characterization, since these are among the few characteristics that are intrinsic 
to the method and vary little across implementations. For instance, the neural 
network tools rate very low on both comprehensibility and interpretability; de- 
cision trees and rules rate high on both counts whereas for mlcibl, the method 
(learning by similarity) is easier to grasp than the ’model’ (distance measures of 
nearest neighbors). 
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2.4 Evaluation Strategies 

To ensure that all recorded learning episodes conform to the elementary rules of 
tool evaluation, each modelling process is associated with a fully specified evalua- 
tion strategy. Examples of generic strategies are simple holdout, cross-validation, 
bootstrap, and subsampling without replacement. Each has its own particular 
set of parameters: the proportion of the training set for holdout, the number of 
folds for cross-validation, or the number of replicates for bootstrap. Attributes 
common to all strategies are the number of trials (complete runs of the selected 
strategy) and the random seed used. Such detailed accounts have a two-fold mo- 
tivation. First, we all know that the same method applied to the same dataset 
can lead to widely different performance measures, depending on whether these 
were estimated using simple holdout or leave-one-out cross-validation. Many 
of the cross-experimental comparisons reported in the literature deserve little 
credence for lack of evidence that performance measures were obtained under 
identical or at least comparable learning conditions. Secondly, information about 
the evaluation strategy followed in an experiment should be sufficiently precise 
and complete to allow for replication and take-up by other researchers. 

3 The Implementation 

3.1 The Case Base 

The conceptual schema described in the preceding section has been implemented 
using CBR- Works Professional. Given the sheer volume of the collected meta- 
data, interactive data entry was out of the question. A set of scripts was im- 
plemented to automate the translation of data characterization files as well as 
results of learning experiments into CBR- Works’ Case Query Language. The 
current case base contains objects representing: 

— more than 1350 datasets for classifications tasks (98 UCI and other bench- 
marks plus semi-artificial datasets generated from these for irrelevance and 
missing values experiments) 

— around 37500 variables belonging to the above datasets 

— around 11700 experiments involving the training and evaluation of 9 classi- 
fication algorithms on the above datasets. 



3.2 Exploitation Scenarios 

The basic scenario follows the standard CBR cycle consisting of the four R’s: 
retrieve, reuse, revise, retain fp. A query case is entered in the form of a Mod Pro- 
cess object whose minimal content is a set of links to a three objects representing 
a dataset, a modelling tool, and an evaluation strategy. While the last two can 
be left completely unspecified, the application dataset should be fully charac- 
terized in the query case’s Dataset object, itself linked to objects describing its 
component variables. Optionally, users can fill out slots of all the other objects 



Fusion of Meta-knowledge and Meta-data for Case-Based Model Selection 



189 



of the query case in order to specify a set of constraints based on the nature 
of the application task or their own preferences. For instance, they can impose 
preferences on the incrementality, comprehensibility, or stability of a modelling 
tool by filling out the relevant slot of the ModTool object. In such cases, learner 
characteristics serve in prior model selection by restricting from the outset the 
space of tools to consider. Users can also use the ModProcess object to specify 
what they consider acceptable performance (e.g., a lower bound on the accuracy 
gain ratio or an upper bound on the error rate or the time to be spent in training 
or testing) . The system retrieves k most similar cases (where k is a, user-specified 
parameter), each of which can be taken as a combined recommendation for a 
modelling tool, its parameter settings, and an evaluation strategy. The user runs 
the recommended algorithms and the results are integrated into the case mem- 
ory as k new cases and their associated objects. This basic scenario illustrates 
the use of standard CBR to incrementally improve model selection by learning 
from experience. 

For case retrieval to work properly, additional domain-specific knowledge 
has been embedded in the similarity measures. While standard symmetric cri- 
teria work well with boolean or categorical meta-attributes such as learner 
incrementality, method, or cost handling ability, ordered features usually call 
for asymmetrical similarity criteria. Ordinal (including real) values specified in 
a query case are often meant as lower or upper bounds on the corresponding 
attribute. For instance, a user who requires medium interpretability of learned 
models would be even happier with high or very high interpretability. Similarly, 
an error specification of 0.2 in the query case should be taken to mean the high- 
est acceptable error, with errors <0.2 getting proportionally higher similarity 
scores as they approach 0. On the contrary, an accuracy gain ratio of 0.1 should 
be interpreted as a minimum, with higher values becoming progressively more 
’’similar ” as they approach 1. 

The reverse exploitation scenario illustrates the use of the system as an ex- 
ploratory workbench which the user (who may happen to be a KDD researcher) 
can use to gain insights about learning tools of interest. In problem-solving cum 
learning mode, the goal is: Given this dataset, which modelling tool(s) should I 
use to get best results? In exploratory mode, the goal can be stated thus: Given 
this modelling tool, for which class(es) of tasks/data is it most appropriate? To 
chart the region of expertise of a learning algorithm, the user enters a query 
case consisting mainly of the learning algorithm and a bound on some perfor- 
mance measure (e.g., an error rate <learning algorithm’s region of competence 
as defined by the performance criteria used. 

3.3 Validating the System 

We have described an initial implementation of a case-based assistant which 
recommends modelling tools for a given learning task. It provides decision sup- 
port by incorporating meta-knowledge of the model selection problem into its 
basic learning mechanism. The goal is to develop a workbench for incremental 
meta-learning which is on the agenda for the third year of the Metal project. 
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To validate the system we need to set a baseline, i.e., measure the perfor- 
mance of the system with its initial knowledge and case base, and then evaluate 
its ability to improve performance with experience. We propose the following 
experimental setup: First, divide the exising meta-dataset into 3 roughly equal 
subsets (around 4500 learning episodes each) and prime the learner with subset 
1. Second, use subset 2 to ’’grow ” the system. Enter each case without the 
target performance measure (e.g., accuracy gain ratio) as a query, then com- 
pare system recommendations with known results and add the query case to the 
growing base. After addition of a fixed number of cases (e.g., 200), test the case 
base on subset 3 in view of plotting performance variation with experience. 



4 Summary 

We presented a case-based assistant for model selection which combines three 
major features. First, it combines knowledge of learning algorithms with dataset 
characteristics in order to facilitate model selection by focusing on the most 
promising tools on the basis of user specified constraints and preferences. Second, 
it incorporates a mechanism for distinguishing between different parameteriza- 
tions of learning tools, thus extending model selection to cover both the choice 
of the learning algorithm and its specific parameter settings. Third, it integrates 
meta-data gathered from different learning experiences with meta-knowledge not 
only of learning algorithms but also of modelling processes, performance metrics 
and evaluation strategies. We are aware of no other system that has all these 
features simultaneously. As pointed out in the introduction, mainstream meta- 
learning for model selection has focused mainly on dataset characteristics as 
predictors of the appropriateness of candidate learning algorithms. Todorowski 
m has tried to go beyond summary dataset statistics and examine character- 
istics of the individual variables in the data in order to learn first-order model 
selection rules. However he does not incorporate knowledge of learning tools or 
their parameters. 

We believe that the proposed system is not just useful for meta-learning but 
can also evolve into some kind of meta-memory of machine learning research. 
There is a need for a system that manages and maintains meta-knowledge in- 
duced from experimentation together with information about the experiments 
themselves. First, such a long-term memory would allow reliable cross-experi- 
mental comparisons. It is common practice among machine learning researchers 
to compare new observations on tool performance and efficiency with past find- 
ings; however, in the absence of clear indications concerning experimental condi- 
tions (parameter settings of the learning tools, evaluation strategies used, train- 
ing and test sample sizes, statistical significance of results, etc.), there is no 
guarantee that the measures being compared are indeed comparable. Second, 
it would avoid redundant effort, as researchers pursuing an idea or hypothe- 
sis could first consult the store of accumulated knowledge before designing new 
experiments. 
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Abstract. Recently, association rule mining has been generalized to the 
discovery of episodes in event sequences. In this paper, we additionally 
take durations into account and thus present a generalization to time 
intervals. We discover frequent temporal patterns in a single series of 
such labeled intervals, which we call a state sequence. A temporal pattern 
is defined as a set of states together with their interval relationships 
described in terms of Allen’s interval logic, for instance “A before B, 
A overlaps C, C overlaps B” or equivalently “state A ends before state 
B starts, the gap is covered by state C” . As an example we consider 
the problem of deriving local weather forecasting rules that allow us to 
conclude from the qualitative behaviour of the air-pressure curve to the 
wind-strength. Here, the states have been extracted automatically from 
(multivariate) time series and characterize the trend of the time series 
locally within the assigned time interval. 



1 Introduction 

To predict or forecast a system’s behaviour in the near future it is probably best 
to develop a global model of the system and to estimate its parameters with the 
help of observations in the past. But the identification of such a model requires 
substantial knowledge about the whole system, which is absent in typical knowl- 
edge discovery applications. Nevertheless, we often expect certain relationships 
between measured variables and the systems behaviour in the future, may be we 
have already some snapshots of typical behaviour in mind, but we are far away 
from being able to model the system as a whole. Such typical key situations are 
often associated with a typical qualitative behaviour of measured variables, and 
consequently humans control technical systems often by simple visual inspection 
of displayed trends [4]. Examples of rules using qualitative descriptions of time- 
varying data can be found in the domain of medical diagnosis, material science 
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[6], diagnostics and supervision [11] or qualitative reasoning [12], to mention 
only a few. In this paper, we consider the problem of deriving such local rules 
inductively by observing the variables for a long period of time. 

Why qualitative descriptions at all? The problem of finding common char- 
acteristics of multiple time series or different parts of the same series requires a 
notion of similarity. If a process is subject to variation in time (translation or 
dilation), those measures used traditionally for estimating similarity (e.g. point- 
wise Euclidean norm) will fail in providing useful hints about the time series 
similarity in terms of the cognitive perception of a human. This problem has 
been addressed by many authors in the literature, e.g. [1, 5]. Here we use qual- 
itative descriptions to divide up the time series in small segments, each of it 
easy to grasp and understand by the human. Matching of time series is then 
performed on the basis of these labeled segments rather than on the raw time 
series. The basic descriptions can be defined a priori (for example “slightly in- 
creasing segment”) [4, 14, 6], can be learned from a set of examples (labeled 
training set), or can be found automatically by means of clustering short subse- 
quences [7]. Finally, we arrive at a sequence of labeled intervals: time intervals 
in which a certain condition holds in the original time series. 

This paper considers the problem of discovering temporal relationships be- 
tween primitive patterns in time series in a fairly general manner: The time series 
is turned into a sequence of labeled intervals in Sect. 2. A temporal pattern will 
be defined as a number of states (the primitive patterns) and their temporal 
relationship in terms of Allen’s temporal logic [3] in Sect. 3. After discussing 
how to count patterns in an interval sequence in Sect. 4, we seek for frequent 
patterns in Sect. 5 in a fashion that is similar to the discovery of association rules 
[2], which has been extended to event sequences in [13, 15]. Given the frequent 
patterns, rules about temporal relationships can be derived. As an application 
of this algorithm, we consider the problem of finding rules about the qualitative 
behaviour of multivariate time series in Sect. 6. 



2 State Sequences 

Let S denote the set of all possible trends, properties, or states that we want 
to distinguish, for example “pressure goes down” or “water level is constant”. 
A state s € S holds during a period of time [b,f) where b and / denote the 
initial point in time when we enter the state and the final point in time when 
the state no longer holds. A state sequence on is a series of triples defining 
state intervals 



(5l,Sl,/l), {b 2 ,S 2 ,f 2 ), (&3,S3,/3), ( 64 , S 4 , /4), ••• 

where bi < 6j+i and < fi holds. We do not require that one state interval 
has ended before another state interval starts. This enables us to mix up several 
state sequences (possibly obtained from different sources) into a single state 
sequences. 
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However, we do require that every state {bi,s,fi) is maximal in the sense, 
that there is no (bj,s,fj) in the series such that [6*,/*) and [bj,fj) overlap or 
meet each other: 

V(&i, Si, fi), (bj, Sj, fj), i j fi bj ^ Si ^ Sj (1) 

If (1) is violated, we can merge both state intervals and replace them by their 
union (mm{bi,bj),s,max(fi, fj)). 



3 Temporal Patterns 



We use Allen’s temporal interval logic [3] to describe the relation between state 
intervals. For any pair of intervals we have 13 possible relationships; they are 
illustrated in Fig. 1. For example, we say “A meets B” if interval A terminates 
at the same point in time at which B starts. The inverse relationship is “B is- 
met-by A” . In the following we denote the set of interval relations as shown in 
the figure by I. 



A 

B 



A after B 
A is-mel-by B 
A is-ovcrlappcd-by B 
A finishes B 
A during B 
A is-startcd-by B 
A equals B 




time 



B before A 
B meets A 
B overlaps A 
B is-finished-by A 
B contains A 
B starts A 
B equals A 



Fig. 1. Allen’s interval relationships. 

Given n state intervals {bi,Si,fi), 1 < i < n, we can capture their relative 
positions to each other by an n x n matrix R whose elements R[i,j] describe the 
relationship between state interval i and j. As an example, let us consider the 
state sequence in Fig. 2. Obviously state A is always followed by B. And the lag 
between A and B is covered by state C. Below the state interval sequence both of 
these patterns are written as a matrix of interval relations. Formally, a temporal 
pattern of size n is defined by a pair (s, R), where s : {1, .., n} — >• <S maps index 
i to the corresponding state, and R G denotes the relationship between 

[6j, fi) and \bj,fjY. By dim(P) we denote the dimension (number n of intervals) 
of the pattern P. If dim(F) = k, we say that P is a A:-pattern. Of course, many 
sets of state intervals map to the same temporal pattern. We say that the set 
of intervals {(6j, Sj, /,) 1 1 < i < n} is an instance of its temporal pattern (s, R). 
We define the space TP{S) of temporal patterns over S informally as the space 
of all valid temporal patterns of arbitrary dimension^. 

^ To determine the interval relationships we assnme closed intervals [b, , /,] 

^ Conditions for a valid temporal pattern are, for instance, that R[i, j] is always the 
inverse of R[j, i]. 
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state interval sequence: 

C D C F C 

A B A ~B~ ~~K I B 

time 



temporal relations: 





A B 


ABC 


A 


= b A 


II 

0 


B 


a = B 


a = io 




C 


io o = 



(abbreviations: a=after, b=before, o=overlaps, io=is-overlapped-by) 



Fig. 2. Example for state interval patterns expressed as temporal relationships. 

Next, we define a partial order C on temporal relations. We say that tem- 
poral relation (sa,Ra) is subpattern of (sb,Rb) (or {sa,Ra) ^ (sb,Rb)), 
if dim(s^,7?yi) < dim(sB,7?s) and there is an injective mapping tt : 
{1, .., dim(syi, {1, .., dim(ss,i?B)} such that 

yi,j € {1, ..,dim(sA, j?.4)} : RA[i,j] = f?js[7r(f), 7r(j)] 

The relation C is reflexive and transitive, but not antisymmetric: we can 
have (sajRa) Q {sb,Rb) and {sb,Rb) Q {sa,Ra) without = sb and 
Ra = Rb due to a different state ordering. But permutating the states 
does not change the semantics of the temporal pattern. Therefore, we define 
(sa,Ra) = (sb,Rb) (sa,Ra) ^ (sb,Rb)/\ (sb,Rb) ^ (sa,Ra) and con- 
sider the factorisation -/=), where C has been generalized canonically 

to equivalence classes. Then, -/= is also antisymmetric and thus a partial order 
on (equivalence classes of) temporal patterns. 

To simplify notation we pick a subset NTP{S) C TP{S) of normalized tem- 
poral patterns such that NTP{S) contains one element for each equivalence class 
q£TP(5) !_ {NTP{S), C) is isomorphic to -/=)• In the remainder, 

we will then use (7VTF(<S),C) synonymously to -/=). Within each 

equivalence class, we can order the patterns lexicographically by initial time, 
final time, and state. This ordering is unique thanks to (1). We use the first 
pattern in this ordering as the representative of the class. 



4 Occurrences of Temporal Patterns in State Sequences 

To be considered interesting, a temporal pattern is limited in its extension, that 
is, the whole pattern has to be small enough to be observed by a (forgetful) op- 
erator. We therefore choose a maximum duration tmax, which serves as the width 
of a sliding window which is moved along the state sequences. We consider only 
those pattern instances that can be observed within this window. In a monitor- 
ing and control application, this threshold could be taken from the maximum 
history length that can be displayed on the monitor and thus be inspected by 
the operator. 

We define the total time in which the pattern can be observed within the 
sliding window as the support supp(P) of the pattern P. (Space limitations 
prohibit the justification of this choice, we refer the interested reader to [9].) Let 
us illustrate this definition with some examples in Fig. 3. In subfigure (a) we have 
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a single state A. We see the pattern for the first time, when the right bound of 
the sliding window touches the initial time of the state interval (dotted position 
of sliding window). We can observe A unless the sliding window reaches the 
position that is drawn with dashed lines. The total observation time is therefore 
the length of the sliding window tmax plus the length of state interval A. The 
support (observation duration) is depicted at the bottom of the subfigure. 

Subfigure (b) shows another example “A overlaps B”. We can observe an 
instance of the pattern as soon as we can see state B and we loose it when A 
leaves the sliding window. If the pattern occurs multiple times, two things may 
happen: If there is a gap between the pattern instances, such that we loose the 
pattern in the meanwhile, then the support of the individual instances add up 
to the support of the pattern, as shown in subfigure (c). If there is no such gap 
(subfigure (d)), we see the pattern as soon as a first instance enters the sliding 
window until the last instance leaves the window. In the meantime, it does not 
matter how many instances are present, as long as there is at least one. 
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support ” ^ support 



Fig. 3. Illustration of our notion of support. 

If we divide the support of a pattern by the length of the state sequence plus 
the window width tmax we obtain the relative frequency p of the pattern: If we 
randomly select a window position we can observe the pattern with probability 
p. Also note that there is no need for discretization, we can handle time contin- 
uously by jumping from interval bound (initial or final time) to interval bound 
and integrating the support over the jump period. This is because observability 
of a pattern changes only if the sliding window meets one of the interval bounds. 

5 Discovery of Temporal Rules 

A pattern is called frequent, if its support exceeds a threshold supp^^jj. The task 
is to find all frequent temporal patterns in NTP{S), from which we then create 
the temporal rules. To find all frequent patterns we start in a first database pass 
with the estimation of the support of every single state (also called candidate 
1-patterns). After the fcth rnn, we remove all candidates that have missed the 
minimum support and create out of the remaining frequent fc-patterns a set of 
candidate {k -F l)-patterns whose support will be estimated in the next pass. 
This procedure is repeated until no more freqnent patterns can be fonnd. The 
fact that the support of a pattern is always less than or equal to the support of 
any of its subpatterns 

Vpatterns P,Q '■ <9 E P supp(<9) > supp(P) 



(2) 
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guarantees that we do not miss any frequent patterns. At this level of detail the 
procedure is identical to association rule mining [2]. 



5.1 Candidate Generation 

The number of potential candidates grows exponentially with the size k of the 
patterns. Efficient pruning techniques are therefore necessary to keep the increase 
in the number of candidates moderate. We use three different pruning techniques. 

The technique that is used for the discovery of association rules [2] can still 
be applied to temporal patterns: Due to (2), every fc-subpattern of a (A; + 1)- 
candidate must be frequent, otherwise the candidate itself cannot be frequent. 
To enumerate as few non-candidate {k + l)-patterns as possible, we join any two 
frequent fc-patterns P and Q that share a common [k — l)-pattern as a prefix. 
Let us denote the remaining states in P and Q besides those in the prefix as 
p and q respectively. We denote the interval relationship between p and q in 
the candidate pattern X = {sx,Rx) as Rx[k,k -|- 1] = r. Figure 4 illustrates 
how to build the (k + l)-pattern matrix Rx out of Rp and Rq. Since Rp and 
Rq are identical with respect to the first k — 1 states in normalized form, the 
same is true for the new pattern X (indicated by the same submatrix A). The 
relationship between p and q and the first A: — 1 states can also be taken from 
Rp and Rq- Thus, as we can see in Fig. 4(c), the only degree of freedom is r. 
From the {k — l)-pattern prefix and the two states p and q we thus can build up 
a (A: -I- l)-pattern which is completely specified up to the relation between p and 

q- 
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(a) Pattern P 
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(b) Pattern Q 
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P. 
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sx : 
k-1 
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D 


p 


C 


= 


r 


q 


E 


ir 


= 



(c) New pattern X 



Fig. 4. Generating a candidate {k + l)-pattern X out of two A;-patterns P and Q that 
are identical when restricted to the first A; — 1 states. 

The freedom in choosing r yields 13 different patterns that might become 
candidate {k + l)-patterns, because there are 13 possible interval relationships. 
Since we can restrict ourselves without loss of generality to normalized patterns, 
the number of possible values for r reduces to a maximal number of 7. Before 
we check each of the seven (A: -I- l)-patterns for frequent A:-subpatterns, we apply 
another pruning technique based on the law of transitivity. For example, the 
two 2-patterns “A meets B” and “A meets C” share the primitive 1-pattern “A” 
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as a common prefix. We have to fix the missing relationship between B and C 
to obtain a 3-candidate. The law of transitivity for interval relations [3] tells 
us that the possible set of interval relations is {is-started-by, equals, starts}. In 
normalized form, only 2 out of 7 possible relationships remain. In general, for 
each state s(i) of the first A; — 1 states we apply Allen’s transitivity table to the 
relationship between p and s(i) {Rp[k,i\) and s{i) and q A:]). Only those 

values for r that do not contradict the results of the k — 1 applications of the 
transitivity table yield a candidate pattern. 

Finally, for every temporal pattern Q we maintain an observed and expected 
support set Oq and Eq, resp. The set Oq contains all points in time that 
contribute to the support of the pattern Q, that is, all points in time in which 
the pattern can be observed in the sliding window. Before we consider a (Ai -F 1)- 
pattern F as a candidate pattern, we intersect^ all sets Oq of all A:-subpatterns 
Q of P. The result gives us the expected support of P in Ep. The cardinality 
of Ep serves as a tighter upper bound of the support of P than min{FQ | Q r 
P, dim(Q) = A:} does. If it stays below supp^,jj the pattern cannot become a 
frequent pattern, therefore we do not consider it as a candidate. 



5.2 Support Estimation 

Again, due to space limitations we can give only a quick overview of the basic 
ideas, a more detailed report is currently in preparation (contact the author). 

In order to estimate the support for the candidate patterns, we sweep through 
the state sequence and incrementally update the list of states which are currently 
visible in the sliding window. We also update the relation matrix for the states 
in the sliding window incrementally. By t^ct we denote the right bound of the 
sliding window. 

The set of candidate patterns is partitioned into three subsets, which we call 
the set of passive, active, and potential candidates. The set of passive candidates 
contains those candidates P that we do not expect in the current sliding window 
because the expected support does not contain the time of the current window 
position, that is, tact ^ Ep. The set of potential candidates contains those can- 
didates for which we have tact € Ep, that is, there is a chance of observing P in 
the window. Finally, the set of active patterns contains those patterns that are 
currently observable in the sliding window. 

At the beginning all patterns are passive patterns. Associated with every 
pattern we have the set of expected support Ep, we therefore know in advance 
when the pattern will become a potential pattern, namely at activation time 
Up = min{t 1 1 € Ep}. If the set Ep is organized as a sorted list of intervals, the 
minimum is simply the left bound of the first interval in the list. We keep the set 
of passive patterns ordered by their activation time. Whenever tact reaches the 
activation time of a pattern P, P becomes either a potential or active pattern, 
depending on whether P occurs in the sliding window or not. When P becomes 

® The sets Oq and Eq can be organized as lists of intervals. The intersection is also a 
list of intervals. We only have to add up the interval lengths to obtain the cardinality. 
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a potential pattern, we remove the leading interval from the Ep list and store 
the deactivation time dp (end of the interval) , because at that time the pattern 
will fall back into the set of passive patterns. 

A potential pattern P becomes a passive pattern if the fall back-time dp 
has been reached by the sliding window. Whenever a new state interval enters 
the sliding window, we check for all potential patterns if an instance of the 
pattern can be found. (Since the set of potential pattern may become quite 
large, this is the most expensive operation.) If this is the case, the potential 
pattern becomes an active pattern, otherwise we keep it as a potential pattern. 
If a pattern instance has been found, we calculate the point in time when the 
pattern disappears and use it as the fall back-time for the active pattern. 

Just like the set of passive patterns, the set of active patterns is sorted by their 
fall back-times. Whenever tact reaches the fall back-time of an active pattern, 
we check whether a new pattern instance has entered the sliding window in the 
meanwhile. In this case the pattern remains an active pattern, but we update 
the fall back-time. Otherwise, depending on whether dp < t^ct or not, the active 
pattern becomes a potential or passive pattern. 

Whenever a pattern instance has been found, the support of the pattern 
is updated incrementally, that is, we insert the period of pattern observation 
(the support) into Os- Since we have an upper bound of the remaining support 
(namely the cardinality of the continuously updated set Ep), we can perform 
a fourth online pruning test. If the support achieved so far (card(Op)) plus 
the maximally remaining support (card(E'p)) drops below supp,„j„ we do not 
consider the pattern any longer. At the end of each database pass, the set Ep 
is empty and Op contains the support of P, which is then subsequently used in 
the next candidate generation step for pruning. 



5.3 Rule Generation 

After having determined all frequent temporal patterns, we can construct rules 
X from every pair {X, Y) of frequent temporal patterns with X ^Y. We 
restrict ourselves to “forward rules” , that is, rules that make conclusions in the 
future rather than in the past. If the confidence of the rule conf(A —^B) = 
supp(A) greater than the minimal confidence, the rule is printed. Enumeration 
of all possible rules can be done efficiently using techniques described in [2]. 

5.4 Disjunctive Combination of Temporal Patterns 

When analysing the rules obtained by the algorithm, we must keep in mind 
that we were seeking for the simple interval relationships only, that is, those 
relationships that consist of a single attribute r G I. If a process B is started 
some time after A has started, then this can result in a number of rules “A — ^ B” 
with temporal relationships overlaps, meets, and before. The confidence of the 
true relationship (which is in this case: A overlaps/meets/before B) might be 
very high, but the confidence values we observe for the three rules we have found 
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are comparatively low. We are not allowed to add up the confidence values of 
all three rules in order to obtain the confidence of the composed rule. This 
would lead to an overestimation, because there might be sliding windows that 
contain multiple of these patterns simultaneously, and in this case we would 
count them twice (or more). Fortunately, it is possible to calculate the support 
of composed rules afterwards. The support of a pattern P which is a disjunction 
of two patterns Q and R can be calculated easily as supp(F) = card(Og U Or). 
The sets of observed support Oq and Or have been calculated already during 
the execution of the algorithm, all we have to do is to store the sets for later 
access. (Note that we cannot guarantee that we will find all frequent pattern 
compositions in this way. Several patterns that do not reach supp^^j-^, individually 
might fulfil this requirement after their combination.) 



6 Evaluation and Discussion 

We have examined air-pressure and wind strength/wind direction data from a 
small island in the northern sea^. From the time stamps we have also extracted 
the season. It is well known that local differences in air pressure are the cause 
for wind, therefore we should find some relationships between these variables. 
Although global weather forecast is (more or less) done perfectly by large-scale 
weather simulations, it is still not possible to precisely localize where a certain 
weather phenomenon will occur to which extent at what time. Rules about the 
qualitative behaviour of the air pressure curve indeed help sailors in short-term 
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Fig. 5. Extracted features from time series: wind strength, Helgoland, April 1997. 

The data has been measured hourly and we used 3, 6, and 9 years of data 
from 1991-1999 to test the algorithm. We have applied kernel smoothing in 
order to compensate for noise and to get more robust estimates of the first 
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and second derivative. Then, the smoothed series have been partitioned into 
primitive patterns like “increasing” , “concave” , “high-value” , etc. See Fig. 5 for 
an example. 

Table 1 shows the performance of the pattern mining algorithm with different 
average state densities, window widths, and state series lengths. The threshold 
supPmin has been chosen to be 2% of the data period in all runs. The computation 
times ranged from a few seconds to 20 minntes on a 550 MHz Pentinm III 
processor with 256MB main memory. We can see that the pruning techniques 
were quite efficient, besides a few exceptions, oniy 1-3% of aii processed patterns 
became candidate patterns. The artificially generated data set has a rich pattern 
structure, on the average 45% of all candidates became frequent patterns. This 
value increases if we consider only runs with large window widths. If the state 
density D (average number of state intervals visible in the sliding window) is 
fixed, the rnn time is ronghly linear in the size of the state series. 



s 


W 


D 


F 


F/C 


CfF 


T 


F 


FjC 


C/P 


T 


F 


F/C 


C/P 


T 


8 


18 


3.42 


191 


30.3 


5.1 


1.19 


178 


29.4 


5.0 


1.31 


28 


7.1 


53.1 


0.56 


8 


30 


5.70 


1,126 


56.2 


3.4 


2.95 


1,055 


54.9 


3.5 


3.03 


96 


20.9 


8.2 


0.65 


8 


42 


7.98 


4,904 


78.3 


2.3 


8.27 


4,459 


78.1 


2.3 


7.97 


249 


27.3 


7.2 


1.11 


15 


18 


5.40 


1,071 


55.8 


2.7 


2.06 


471 


25.0 


1.6 


2.19 


829 


46.7 


1.8 


2.01 


15 


30 


9.00 


2,779 


42.6 


2.4 


5.24 


2,618 


41.6 


2.5 


5.38 


2,024 


37.1 


2.8 


4.67 


15 


42 


12.6 


12,900 


67.3 


1.3 


16.0 


11,986 


66.4 


1.4 


15.7 


9,618 


63.3 


1.5 


13.4 


27 


18 


8.28 


1,600 


25.2 


2.1 


4.27 


1,562 


24.9 


2.1 


5.34 


1,359 


23.0 


2.3 


4.08 


27 


30 


13.8 


9,767 


42.7 


1.8 


12.4 


9,184 


41.6 


1.8 


14.5 


7,082 


37.8 


2.0 


10.1 


27 


42 


19.3 


48,832 


65.3 


0.7 


43.3 


45,302 


64.2 


1.0 


49.0 


34,872 


60.9 


1.1 


32.7 



(a) (b) years ’97-’99 (c) years ’94-’99 (d) years ’91-’99 



Table 1. Results of the algorithm. In all experiments the threshold supp^^jj, has been 
chosen to be 2% of the time series length (3, 6, and 9 years). Column S denotes the 
number distinct states in the series, colnmn W denotes the window width (hours), 
column D the average state series density (average number of states visible in the 
window). Column F contains the number of frequent patterns, F/C the percentage 
of frequent patterns among candidate patterns, and C/F the percentage of candidate 
patterns among processed patterns (that is, candidate and pruned patterns). Column 
T shows the run time per 1000 state intervals in the series. 



Due to the complexity of the temporal patterns, matching a /c-pattern against 
the sliding window is O(D^). Therefore, the complexity of the analysis depends 
on all parameters that influence D, e.g. sliding window width, number and length 
of intervals generated from a time series, size of the set of labels, etc. Further- 
more, if the sliding window content changes quickly, we have to check more 
frequently if potential candidates become active candidates. Another point is 
the number of “uninteresting” associations that are generated by the interval 
extraction: If the state series represents extracted local trends in time series it 
is natural that we observe many frequent patterns like “increasing segment be- 
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fore decreasing segment” or “concave before convex segment”, and vice versa. 
These uninteresting frequent patterns can be combined to patterns with more 
than 2 states arbitrarily and have considerable impact on the number of frequent 
patterns (and thus on run-time). 
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(a) 
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Fig. 6. Some exemplary rules. The bars indicate the temporal relationship between 
the intervals, their length has been chosen arbitrarily. The label in the bar describes a 
condition that holds in the interval (where grd denotes gradient, crv curvature, etc.). 

Due to lack of space, here are only some exemplary rules. We have generated 
only those rules with a conclusion lying in the future (with respect to the intervals 
in the premise). Among them, there were many rules predicting a high gradient 
in wind strength. Fig. 6(a) shows one of them: If a period of highly increasing or 
decreasing air pressure overlaps a period of high curvature, it is very likely that 
the wind strength will change quickly (with a high gradient). The depicted rule 
occurs also with during and meets relationships between the air pressure states, 
so a disjunctive combination as described in Sect. 5.4 is promising. Figure 6(b) 
is an example for a rule that concludes from a change in wind direction to a 
strong change in wind strength. The rule in Fig. 6(c) tells us that stable weather 
(air pressure is nearly constant) is likely to be continued in summer, that is, a 
constant air pressure segment is followed by another constant air pressure period 
with low winds. Similar rules for other seasons can also be found, but with a 
much lower confidence value. 

On the average, the confidence values of the rules are comparatively low 
(about 40-60% for the examples). This is because simple patterns (used in the 
premise) can be observed longer than complex patterns (patterns comprising 
premise and conclusion). To illustrate this, review Fig. 3(a)-(b), where the sup- 
port of pattern “A” is greater than the support of pattern “A overlaps B”, 
although A has the same length in both cases. This leads to confidence val- 
ues below 1 even if every A overlaps a B in the examined state series. We are 
investigating on other measures for rule evaluation in [9] . 

7 Conclusion 

We have proposed a technique for the discovery of temporal rules in state se- 
quences, which might stem from multivariate time series for instance. The ex- 
amples in Sect. 6 have shown that the proposed method is capable of finding 
meaningful rules that can be used as rules-of-thumb by a human, but also in 
a knowledge-based expert system. The rules can be easily interpreted by a do- 
main expert, who can verify the rules or use them as an inspiration for further 
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investigation. Even if there is already considerable background knowledge, the 
application of this method might be valuable if the known rules incorporate 
more variables than readily available. For instance, weather forecasting rules 
as discussed by Karnetzki [10] also use information about the general weather 
outlook (cloudiness) or information from the local weather forecasting station. 
Such information might be difficult to incorporate or expensive to measure, and 
in such a case one is interested in how much one can achieve by just using the 
available variables. Selection of the best rules gets further treatment in [9]. 
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References 

[1] R. Agrawal, K.-L. Lin, H. S. Sawhney, and K. Shim. Fast similiarity search in 
the presence of noise, scaling, and translation in time-series databases. In Proc. 
of the 21st Int. Conf. on Very Large Databases, 1995. 

[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast dis- 
covery of association rules. In [8], chapter 12, pages 307-328. MIT Press, 1996. 

[3] J. F. Allen. Maintaing knowledge about temporal intervals. Comm. ACM, 
26(ll):832-843, 1983. 

[4] B. R. Bakshi and G. Stephanopoulos. Reasoning in time: Modeling, analysis, 
and pattern recognition of temporal process trends. In Advances in Chemical 
Engineering, volume 22, pages 485-548. Academic Press, Inc., 1995. 

[5] D. J. Berndt and J. Clifford. Finding patterns in time series: A dynamic program- 
ming approach. In [8], chapter 9, pages 229-248. MIT Press, 1996. 

[6] A. C. Capelo, L. Ironi, and S. Tentoni. Automated mathematical modeling from 
experimental data: An application to material science. IEEE Trans, on Systems, 
Man, and Cybernetics, Part C, 28(3):356-370, Aug. 1998. 

[7] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery 
from time series. In Proc. of the 4th Int. Conf. on Knowledge Discovery and Data 
Mining, pages 16-22. AAAI Press, 1998. 

[8] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Ad- 
vances in Knowledge Discovery and Data Mining. MIT Press, 1996. 

[9] F. Hoppner and F. Klawonn. Finding informative rules in interval sequences. In 
Proc. of the 4th Int. Symp. on Intelligent Data Analysis, Lissabon, Portugal, Sept. 
2001. Springer. 

[10] D. Karnetzki. Luftdruck und Wetter. Delius Klasing, 3 edition, 1999. 

[11] K. B. Konstantinov and T. Yoshida. Real-time qualitative analysis of the temporal 
shapes of (bio)process variables. Artificial Intelligence in Chemistry, 38(11):1703- 
1715, Nov. 1992. 

[12] B. Kuipers. Qualitative Reasoning - Modeling and Simulation with Ineomplete 
Knowledge. MIT Press, 1994. 

[13] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in 
event sequences. Technical Report 15, University of Helsinki, Finland, Feb. 1997. 

[14] S. A. Mcllraith. Qualitative data modeling: application of a mechanism for inter- 
preting graphical data. Computational Intelligence, 5:111-120, 1989. 

[15] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and per- 
formance improvements. In Proc. of the 5th Int. Conf. on Extending Database 
Technology, Avignon, Prance, Mar. 1996. 




Temporal Rule Discovery for Time-Series 
Satellite Images and Integration with RDB 



Rie Honda^ and Osamu Konishi^ 



Department of Mathematics and Information Science 
Kochi University, Akebono-cyo 2-5-1 Kochi, 780-8520, JAPAN 
{honda,konishi}@is .kochi-u. ac . jp 
http : //www. is .kochi-u. ac.jp 



Abstract. Featnre extraction and knowledge discovery from a large 
amount of image data such as remote sensing images have become highly 
required recent years. In this study, a framework for data mining from a 
set of time-series images including moving objects was presented. Time- 
series images are transformed into time-series cluster addresses by using 
clustering by two-stage SOM (Self-organizing map) and time-dependent 
association rules were extracted from it. Semantically indexed data and 
extracted rules are stored in the object-relational database, which al- 
lows high-level queries by entering SQL through the user interface. This 
method was applied to weather satellite cloud images taken by GMS-5 
and its usefulness was evaluated. 



1 Introduction 

A huge amount of data has been stored in databases in the areas of business or 
science. Data mining or knowledge discovery from database (KDD) is a method 
for extracting unknown information such as rules and patterns from a large-scale 
database. The well-known data mining methods include decision tree, association 
rules m m, classification, clustering, and time-series analysis 0, and there are 
some successful application studies for astronomical images such as SKICAT 
and JARtool 

In our recent studies 0 0, we have applied data mining methods such as 
clustering and association rules to a large number of time-series satellite weather 
images over the Japanese islands. Meteorological events are considered to be 
chaotic phenomena in that an object such as a mass of cloud changes its position 
and form continuously, and thus appropriate for experiments of spatial temporal 
data mining. 

Features of our studies applied to the weather images are summarized as 
follows: application of data mining method to image classification and retrieval, 
feature description from time-series data, implementation of the result of classi- 
fication as the user retrieval interface, and construction of the whole system as 
a domain-expert KDD supporting system. 

We describe an overview of the system in Sect. 2. A clustering algorithm 
for time-sequential images and its experimental results are described in Sect. 3. 
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Section 4 describes the algorithm of extraction of time-dependent association 
rules and its experimental results. Section 5 describes details of the construction 
of the database by using R-tree and the results of its implementation. Section 6 
provides a conclusion. 



2 System Overview 

We constructed a weather image database that gathers the sequential changes of 
cloud images and aimed to construct the domain-expert analysis support system 
for these images. The flow of this system is shown in Fig. Eand described as 
follows: 

1 Clustering of frame images using a self-organizing map. 

2 Transformation of time-series images into cluster address sequence. 

3 Extraction of events and time-dependent rules from the time-sequential 
cluster addresses. 

4 Indexing of events and rules by R-tree, and integration with the 
database. 

5 Searching for time-sequential variation patterns and browsing of the 
retrieved data in the form of animation. 

The above-described framework enables us to characterize enormous amount 
of images acquired at a regular time interval semi-automatically, and to retrieve 
the images by using the extracted rules. For example, this framework enables 
queries like “search for frequent events that occur between one typhoon and 
the next typhoon” , or “search for a weather change such that a typhoon occurs 
within 10 days after a front and high pressure mass developed within the time 
interval of 5 days” . 



3 Time-Sequential Data Description by Using Clustering 

3.1 Data Set Description 

Satellite weather images, taken by GMS-5 and received at the Institute of Indus- 
trial Science of Tokyo University, are archived at the Kochi University weather 
page (http://weather.is.kochi-u.ac.jp). In this study, we used infrared band (IRl) 
images around Japanese islands, which reflect the cloud distribution very well. 
The size of image is 640-pixels in width and 480-pixels in height. Each image is 
taken every hour, and about 9000 images are archived every year. 

We considered that conventional image processing methods might be unable 
to detect moving objects such as the cloud masses that change their position as 
time proceeds. Thus we used the following SOM-based method for the automatic 
clustering of images by taking the raster image intensity vectors as the inputs. 
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Fig. 1. Overview of the system. 



3.2 Clustering and Kohonen’s Self-Organizing Map 

Kohonen’s self-organizing map (SOM) 0 is a paradigm which was suggested 
in 1990, which has been widely used to provide a rough structure to a given 
non-structured information. The SOM is a two-layer network composed of a 
combination of the input layer and the competition layer that is trained through 
iterative non-supervised learning. 

Each unit of the input layer and the competition layer has a vector whose 
components correspond to the input pattern elements. The algorithm of the 
SOM is described as follows: Let the input pattern vector V S i?” as M = 
[vi,V2,v^^ ■ ■ •, Vn], and the weight of union from the input vector to a unit i as 
Ui = [uii,Ui 2 , Ui-i, •••, Uin]- Initial values of Uij are given randomly. V is compared 
with all Ui, and the best matching unit which has the smallest Euclidean distance 
is determined and signified by the subscript c. 

c = argmin|E — 17i|. (1) 



Weight vectors of the unit c and its neighbors N^,, which is the area of N x N 
units around the unit c, are adjusted to increase the similarity as follows, 



Uii = 



uff 






.old 



kj) {i G Nc) 



(2) 



where 

a(t) = ao (1 - tjT ) , (3) 

N{t) = fVo (1 - tm . (4) 

The a{t) N{t) are the learning rate and the size of neighbors at the time of 
t iterations, respectively, ao and Nq is the initial learning rate and the size of 
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Fig. 2. Problem for clustering of weather images. 



neighbors, and T is the total number of iterations of learning. The learning rate 
and the size of neighbor decreases as the learning proceeds to stabilize it. 

The input signals V are classified into the activated (closest) unit Uc and 
projected onto the competition grids. The distance on the competition grids 
reflects the similarity between the patterns. After the training is completed, the 
obtained competition grids represent a natural relationship between the patterns 
of input signals entered into the network. Hereafter we call the competition grids 
obtained after the learning as the feature map. 

3.3 Clustering by Two-Stage SOM 

Figure 13 represents the problem of clustering of weather images. Two images in 
Fig. ETA) are considered to have the same features of a typhoon and a front, 
although their forms and positions change as time proceeds. When we take the 
input vectors simply as the raster image intensity vectors, these images are 
classified into the different groups based on the spatial variations of intensity. 
We considered that this difficulty is avoided by dividing the images into blocks 
as shown in Fig.|21(B). 

The procedure adopted here, named two-stage SOM, is shown in Fig. 0 
schematically and described as follows: 

stage 1 Clustering of pattern cells 

step 1 All Images are divided into N x M blocks. 

step 2 Learning by SOM is performed by entering the each block’s raster 
image intensity vector as the input vector successively, 
step 3 Each block of the original images are projected onto the first SOM 
feature map and characterized with the closest unit address. We refer 
to this characterized blocks as the pattern cells. 
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Fig. 3. Clnstering of weather images by SOM. 



stage 2 Clustering of the images by using frequency histograms of 
pattern cells. 

step 1 Each image is transformed into the frequency histogram of the pat- 
tern cells. 

step 2 The feature map of SOM is learned by entering each image’s pattern 
cell frequency histogram as the input vector, 
step 3 Each images are projected onto the second SOM feature map and 
characterized with the closest unit address. 

Although the information of spatial distribution of pattern is lost by transform- 
ing images into frequency histograms of pattern cells (in step 1 of stage 2), this 
enables flexible classification of time-series images which have similar objects at 
different positions as shown in Fig. 2b as the same type of images. 

Hereafter we refer the unit as the cluster, and express the cluster addresses 
by the characters of A, B, C, • • •, P in the raster-scan order from the upper left 
corner to the lower right corner of 4x4 feature map. 

3.4 Result of Experiments on Clustering 

In the experiments, we sampled GMS-5 IRl images with 8 hour time intervals 
obtained between 1997 and 1998, and composed two data set for 1997 and 1998 
which include 1044 and 966 images, respectively. We defined number of blocks 
for each image to be 12 x 16, considering the typical size of cloud masses. The 
sizes of feature maps of both the first stage SOM and the second stage SOM are 
defined to be 4 x 4, which are determined by trial and error. Learning processes 
are iterated 8000-10000 times. 

The result of the experiment shows that images including similar features are 
distributed into similar clusters. We describe clusters semantically by specifying 
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Table 1. Semantical description of each cluster 



cluster address 
1997 1998 


season 


prominent characteristics 


A 


A,F,0 


spring, summer 


front, typhoon 


J 


H 


spring, summer 


rainy season’s front, typhoon 


N 




spring, summer 


high pressure, typhoon 


B,C 




spring autumn 

and low pressure in the east 


high pressure in the west 


D,H 


E 


spring, autumn 


band-like high-pressure 


F 


B 


spring, autumn 


front 


P 


B 


spring, autumn 


migratory anticyclone 


I 


J 


summer 


Pacific high pressure, front 


M 




summer 


Pacific high pressure, typhoon 




C 


summer 


Pacific high-pressure 


E 


D 


autumn 


migratory anticyclone 


G 


P 


autumn, winter 


linear clouds 


K,L 


L,M 


winter 

or linear cloud 


winter type, whirl-like 


O 


I,K,N 


winter 


cold front 



Table 2. Accuracies of clustering 



year 


Recall 


Precision 


1997 

1998 


86.0%(876/1022) 

86.7%(838/945) 


84.6%(876/1044) 

86.7%(838/966) 



the season in which the clusters are observed, based on the frequency of each 
cluster every month, and by describing the representative object such as front 
or typhoon by means of visual observation of images in the cluster in a domain- 
expert like view. Tabled shows the semantical descriptions of clusters for 1997 
and 1998. The description of each clusters for 1998 is different from that for 
1997 since we performed the SOM learning for these data sets independently. 
However, most of the groups are observed in both maps, thus the obtained result 
is meaningful even in the view of the domain-expert knowledge. 

To evaluate the accuracy of clustering quantitatively, we defined the following 
parameters. 

Recall = A/ {A + B), Precision = A/ {A + C), (5) 

where A is the number of the relevant images classified into the cluster, B is 
the number of the relevant images classified into the other cluster, and C is the 
number of the nonrelevant images classified into the cluster. Relevance of images 
are evaluated by classifying the images visually. 

Table dshows the values of recall for 1997 and 1998 to be 86.0% and 86.7%, 
respectively, and that the values of precision are 84.6% and 86.7%, respectively. 
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Fig. 4. Example of description of cluster sequence, event sequence, and extraction of 
time-dependent association rules. 



These values indicate that two-stage SOM can successfully learn the features of 
weather images and can classify them with a high accuracy. 

4 Time-Sequential Analysis and Extraction of 
Time-Dependent Association Rules 

4.1 Time-Dependent Association Rule 

In this study we extract time-dependent association rules such as “weather pat- 
tern B occurs after weather pattern A” , which modify the episode rules m m 
using the concept of cohesion to evaluate its significance. 

First we express the sequence of a weather pattern by (A, 1), (A, 2), {C, 3), • • • 
where each component is a pair of cluster address of image (obtained by SOM) 
and its observation time. Then we define the event Ci in the sequence as contin- 
uously occurring clusters, which is expressed by 

e,=<C„S,f,TS,,TE,> (i=l,---,n), (6) 

where Ci is the cluster address, Sif is the continuity, TSi is the starting time, 
and TEi is the ending time. The sequence S is then represented by 

S' =< 61,62,- • -,e„ >, ( 7 ) 

where n is the total number of the events in the sequence. Figure 21 shows a 
representation of event sequence in the case of S^/ > 2. 

We extract the event pairs that occur closely in the sequence by introducing a 
local time window. Assuming the local time window with the length of neighbor, 
the simple local variation pattern E is represented by 

E = {[a, 6j], neighbor) {i e {!,■■■ ,n - l},j e {2, • • • ,n}), (8) 

where [ei, ej] is a combination of the two events et and ej which satisfies i < j 
and TSj — TEi < neighbor. 

Although neighbor is an idea similar with a time window in episode rules 
d m. we use this concept to extract only serial episodes such as A => R, 
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excluding parallel episode rules and combination of serial/parallel episode rules 

which are included in nn m 

Furthermore we use the method of co-occurring term-pair for document anal- 
ysis UDI to evaluate strength of correlation of event pairs which occurs in the 
local time window and to extract the prominent pairs as rules. The cohesion of 
the event ei and Cj in a local time window is defined by 

cohesion{ei,ej) = (9) 

V[/(e*) X f(ej)] 

where f{ei) and f{ej) are the frequencies of and e^-, respectively, and Ef{et, Cj) 
is the frequency of the co-occurrence of both and ej in a local time window. 
The time-dependent association rules are extracted when the event pair has 
larger cohesion than the threshold. 

The procedure of extraction of time-dependent association rules in each local 
time window with the length of neighbor is described in the following: 

step 1 The frequency of each event /(e) in a local time window is deter- 
mined. 

step 2 A combinational set of event pairs in a local time window are listed 
as rule candidates. 

step 3 Candidate pairs are sorted lexicographically in regard to the first 
event and then the following event. 

step 4 The same event pairs are bound, co-occurrence frequency of each 
candidate pair Ef{ei,ej) is counted, and cohesion are calculated, 
step 5 The event pairs that have larger cohesions than the threshold are 
extracted as rules. 

It should be noted that extraction is performed for each local time window by 
sliding its position. 

Strongly correlated event pairs have large cohesions even if each event occurs 
less frequently. Inversely, weakly correlated event pairs have small cohesions even 
if each event occurs very frequently. 



4.2 Result of Experiments Regarding Time-Dependent Association 
Rules 

We applied the above-described time-dependent association rule extraction to 
the sequence of cluster address obtained in 3.4. Here we take the threshold of 
cohesion of 0.4 and neighbor ranging from 10 to 50. Since we sampled images 
every 8 hours, the virtual length of neighbor is between 3.3 days and 16.7 days. 

Table 01 shows the relationship between the size of neighbor and the number 
of extracted rules. Although the assessment of the contents of the extracted rules 
and development of its user-interface are ongoing issues, the result suggests the 
similar numbers of rules are extracted from the different year’s data set, which 
indicates that our present method is useful and robust. 
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Table 3. Relationship between neighbor and number of rules. 



neighbor 


10 


20 


30 


40 


50 


number of rules(1997) 


17 


63 


116 


165 


207 


number of rules(1998) 


7 


50 


98 


166 


218 



5 Integration of Extracted Rules and the Relational 
Database 

We integrate image sequences and the extracted rules with the relational data- 
base and construct the system which supports analysis and discovery by domain- 
experts. Here we index the time sequences by using R-tree pj to enable fast query 
operation. 

5.1 Indexing by R-Tree 

As shown in Fig. E| there is a natural hierarchical enclosure relations between 
time sequences such as year, season, month, rule, and event. By using the method 
of R-tree jOj, we can express these time sequences by the minimum bounding 
rectangles (defined by the staring time and the ending time of sequence) and 
store them into the hierarchical tree which reflects the enclosure relation. This 
enables fast query operation of weather patterns by using month or seasons as 
the search key. 

5.2 Definition of Attributes 

We stored extracted patterns in the following three tables: “series (Lterm, r_ternn 
cohesion, location, first, lastJJ “dateJd (id, date), and “series (term, first, last)|j 
that represent contents of time-dependent rules, the relationship between image 
ID and the observation time, and the contents of time-dependent rule compo- 
nents (events), respectively. 



5.3 Query by SQL 

Storing extracted patterns in the database enables the secondary retrieval of 
the various complex patterns by using SQL statements. We show an example of 
complex queries and the corresponding SQL statement in the following: 

^ “Lterm” and “r_term” are the cluster addresses of the first event and the second event 
of extracted rules, respectively, “location” is the reference to the R-tree rectangles, 
and “first” and “last” are the image IDs of the “Lterm” starting point and the 
“r_term” ending point, respectively. 

^ “term” is the cluster number of the event, ’’first” and “last” are the image IDs of 
the starting point and the ending point of “term” , respectively. 
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Fig. 5. Indexing by using R-tree, remarking at the continuing sequence. Arrows at the 
bottom represent minimum bounding boxes of rules. 



“Search for a weather change in 1997 such that a typhoon occurred within 
10 days after a front and a successive high pressure mass developed within 
the time interval of 5 days. ” 

select tl. first, t2.1ast, tS.date, t4.dat^ 

from series tl , e_series t2 , date_id t3 , date_id t4 

where ( tl.l_term = "A" or tl.l_term = "F" or tl.l_term = "I" or 

tl.lTerm = "J" or tl.lTerm = "D") and ( tl.r_term = "B" or 

tl.r_term = "C" or tl.r_term = "D" or tl.r_term = "H" or tl.r_term = 

"I" or tl.r_term = "M" or tl.r_term = "N" ) and ( t2.term = "A" or 
t2.term = "J" or t2.term = "M" or t2.term = "N") and (tl . last-tl . first 
<15) and (t2.1ast- tl. first <30) and (tl.last <= t2.1ast) and t3.id 
= tl. first and t4.id = t2.1ast 



5.4 Result of Implementation and Issues in the Future Work 

Figure El shows an example of the user interface of integrated KDD support sys- 
tem for weather information. Here we can retrieve the weather pattern by using 
the season, the first event and the second event as the search keys. Matched 
sequences are listed in the lower left frame, and by selecting one in the list, the 
corresponding weather variation is shown as an animation in the lower right 
frame. To deal with much more complex queries, we also prepare the user inter- 
face which accepts SQL query directly. 

In this system, however, users are unable to operate the process of primary 
knowledge extraction by changing parameters such as the size of SOM, cohesion 
threshold, and the size of the local time window. There are two approaches to 

® Note that time interval 1 in this SQL statement corresponds to 8 hours, and capital 
alphabets indicate the cluster addresses described in tabled 
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Fig. 6. Example of the result of retrieval from sequential image data. 



solve this problem: one is incorporation of the optimization process of these 
parameter^ and another is improvement of interactiveness of the user interface. 
Examination on both approaches will be one of most significant issues in the 
future work. Also we consider designing of the user interface to stimulate expert’s 
natural discovery is also important. Furthermore, improvement of time sequential 
pattern analysis besides simple rule extraction will be significant to deal with 
temporal patterns more meaningful for domain-experts including prediction. 

6 Conclusion 

We applied clustering and time-dependent association rules to a large-scale 
content-based image database of weather satellite images. Each image is au- 
tomatically classified by two-stage SOM. We also extracted unknown rules from 
time-sequential data expressed by a sequence of cluster addresses by using time- 
dependent association rules. Furthermore, we developed a knowledge discovery 
support system for domain experts, which retrieves image sequences using ex- 
tracted events and association rules. From the perspective that high-level queries 
make the analysis easier, we stored the extracted rules in the database to ad- 
mit sophisticated queries described by SQL. The retrieval responses to various 
queries shows the usefulness of this approach. 

The framework presented in this study, clustering => transformation into 
time-sequential data extraction of time-dependent association rules, is consid- 
ered to be also useful in managing enormous multimedia data sets which include 

^ For examples, the algorithm of growing hierarchical SOM which is capable of 
growing both in terms of map size as well as the three-dimensional tree structure, will 
be effective for the adaptation of map size. We would like to examine this algorithm 
in the future work. 
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sequential patterns such as video and audio information or result of numerical 
simulation. 
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Abstract. The World-Wide Web contains a wealth of semistructured 
information sources that often give partial/overlapping views on the same 
domains, such as real estate listings or book prices. These partial sources 
could be used more effectively if integrated into a single view; however, 
since they are typically formatted in diverse ways for human viewing, ex- 
tracting their data for integration is a difficult challenge. Existing learn- 
ing systems for this task generally use hardcoded ad hoc heuristics, are 
restricted in the domains and structures they can recognize, and/or re- 
quire manual training. We describe a principled method for automati- 
cally generating extraction wrappers using grammatical inference that 
can recognize general structures and does not rely on manually-labelled 
examples. Domain-specific knowledge is explicitly separated out in the 
form of declarative rules. The method is demonstrated in a test setting 
by extracting real estate listings from web pages and integrating them 
into an interactive data visualization tool based on dynamic queries. 



1 Introduction 

The World-Wide Web contains a wealth of information resources, many of which 
can be considered as semistructured P data sources: that is, sources containing 
data that is fielded but not constrained by a global schema. For example, doc- 
uments such as product catalogs, staff directories, and classified advertisement 
listings fall into this category. Often, multiple sources provide partial or over- 
lapping views on the same underlying domain. As a result, there has been much 
interest in trying to combine and cross-reference disparate data sources into a 
single integrated view. 

Parsing web pages for information extraction is a significant obstacle, how- 
ever. Although the markup formatting of web sources provides some hints about 
their record and field structure, this structure is also obscured by the presenta- 
tion aspects of formatting intended for human viewing and the wide variation in 
formats from site to site. Manually constructing extraction wrappers is tedious 
and time-consuming, because of the large number of sites to be covered and the 
need to keep up-to-date with frequent formatting changes. 
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We propose the use of grammatical inference to automate the construction of 
wrappers and facilitate the process of information extraction. Grammatical in- 
ference is a subfield of machine learning concerned with inferring formal descrip- 
tions of sets from examples. One application is the inference of formal grammars 
as generalized structural descriptions for documents. 

By applying an inference algorithm to a training sample of web pages from 
a given site, we can learn a grammar describing their format structure. Using 
domain-specific knowledge encoded in declarative rules, we can identify produc- 
tions corresponding to records and fields. The grammar can then be compiled 
into a wrapper which extracts data from those pages. Since the data pages on 
a given website typically follow a common site format, particularly if they are 
dynamically created from scripts, such wrappers should be able to operate on 
the rest of the pages as well. 

This process can be largely automated, making it easy to re-generate wrap- 
pers when sites change formatting. Although others have explored the use of 
machine learning for wrapper creation, previous systems have generally relied 
on hardcoded ad hoc heuristics or the manual labelling of examples, and/or have 
been restricted in the domains and structures they can recognize. Our method is 
more principled, based on an objective method for inferring structure and a de- 
scription language (context-free grammars) of high expressive power. We avoid 
manual intervention as far as possible and explicitly separate out domain-specific 
knowledge using declarative rules. These characteristics should make our system 
easier to use and more broadly applicable in different domains. 

Taking the real estate domain as an example, we demonstrate the use of our 
approach to extract property listings from a set of mock web pages. We also 
briefly show an interactive data visualization tool based on dynamic queries for 
exploring the resulting high-dimensional data space. 

The rest of this paper is organized as follows: in Sect. ^ we introduce the 
formal background to grammatical inference before describing our inference al- 
gorithm in Sect. 01 We then apply the algorithm to the real estate domain in 
Sect. 01 Related work is discussed in Sect. 01 and finally Sect. 01 gives conclusions 
and future work. 

2 Grammatical Inference 

Grammatical inference HH is a class of inductive inference in which the target 
is a formal language (a set of strings over some alphabet S) and the hypothesis 
space is some family of grammars. The objective is to infer a consistent grammar 
for the unknown target language, given a finite set of examples. 

The classical approach to grammatical inference was first given by Gold 1 1 
who introduced the notion of identification in the limit. This notion is concerned 
with the limiting behavior of an inference algorithm on an infinite sequence 
of examples. Formally, a complete presentation of a language L is an infinite 
sequence of ordered pairs {w, 1) in E* x {0,1}, where I = 1 if I G L and 0 
otherwise, and every string w G S* appears at least once. If an inference method 
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j \4 is run on larger and larger initial segments of a complete presentation, it will 
generate an infinite sequence of guesses gi, g2, 53, etc. Ai is said to identify L 
in the limit if there exists some number n such that all of the guesses gi are the 
same for i > n, and is equivalent to L. 

This approach is not directly applicable to the web document task, since only 
positive examples are available (these being the actual documents existing at a 
site) . Gold showed that any class of languages containing all the finite languages 
and at least one infinite language cannot be identified in the limit from only 
positive examples without negative ones. For example, the classes of regular and 
context-free languages both fit this criterion. The problem is that the task is 
under-constrained. Given only positive examples, the inferencer has no basis for 
choosing among hypotheses which are too general (e.g. the language consisting 
of all strings), too specific (e.g. the language consisting of exactly the examples 
seen so far), or somewhere in between. 



3 Inference Algorithm 

We approach the inference problem differently as a search for the simplest gram- 
mar which has a consistent fit with the provided sample, on the assumption that 
simple grammars are more likely to convey meaningful structure. We introduce 
a learning bias to constrain the search by starting from a specialized grammar 
which has high fit but low simplicity, and applying various transformations to 
generalize and simplify it while retaining fit. To guide this process, we take 
the set of stochastic context-free grammars as the hypothesis space and define 
a complexity function on it in terms of description length. Stochastic context- 
free grammars 0 are context-free grammars with probabilities attached to their 
productions. The probabilities aid inference by providing additional information 
about the relative weight of alternative productions — for example, given two al- 
ternatives for some nonterminal, are both equally important or is one likely to 
be just noise? This information is useful for assessing relative complexity and 
performing simplifications or extracting data later on. 



3.1 Measuring Grammar Complexity 

Let G be a stochastic context-free grammar with productions and associated 
probabilities given by: 



Xi Wn 


W12 1 . . 


■ 1 


[Pii, P12, • ■ 


• ) ^ 1.1 


X2 — t W21 


W22 1 . ■ 


■ 1 '^ 2 , m2 


[P21, P22, • ■ 


• ) ^ 2,7 


Xn — t Wnl 


1 Wn 2 1 • 


■ • 1 U)n,mn 


[Pnl j Pn 2 ; • 





where the W are nonterminals, the Wij are alternatives, and the are the 
probabilities associated to those alternatives (i.e. 1 each i). 
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Following Cook et aZ.Q, we define the complexity C{G) as: 

n rrin 

C{G) = EE - log Pij + c{wij) (2) 

i=i j=i 

where c{wij) is a second complexity function on the Wij strings. This definition 
has the intuitively desirable properties that the complexity of G is the sum of the 
complexities of its productions and that the complexity of a production is the 
sum of the complexities of its alternatives. The complexity of an alternative has 
two components, the information-theoretic information content of its probability 
and the complexity of the string produced. Finally, the complexity of a string w 
is a function of its length and the proportions of distinct symbols in it: 

r 

c{w) = {K +l)log{K +1) -'^ki log ki (3) 

i=l 

where w has length K and contains r distinct symbols each occurring ki, fc 2 , . . . , 
kr times, respectively. Longer and more varied strings are rated more complex. 

3.2 Inference as Search 

We can formulate the goal of looking for the simplest consistent grammar as 
a search in the space of grammars where the cost function is the complexity 
function G . Our starting point is the overspecific grammar that simply generates 
the training set with perfect fit: 

S ^ Wi \ W2 \ ■ ■ ■ \ Wm [Pi, P 2 , ■ ■ ■ , Pm] , (4) 

where wi . . . Wm are the strings occurring in the set and Pi . . . Pm are their 
relative frequencies. If all the strings are different, then the Pi will all be equal 
to 1 /to; however, the Pi may vary if some strings appear more than once in the 
set. This initial grammar will generally have very high complexity. 

We can then perform a search by considering various transformation steps 
which might lower the complexity and generalize the grammar while retaining 
good fit. Some of the transformations used (again following (Zj) are: 

1. Substitution: If a substring s occurs multiple times in different alternatives 
(e.g. in the grammar X\ — >■ ash, X 2 — >■ csd), create a new rule Y ^ s 
and replace all occurrences of s by P’s. This transformation helps identify 
subunits of structure. For example, when applied to the productions “John 
is eating cake” and “Mary is eating bread,” it will separate “is eating” into 
another rule. 

2. Disjunction: If two substrings s and t occur in similar contexts (e.g. in the 
grammar Xi — >■ ash, X 2 — )> atb), create a new rule P — >■ s | t and replace 
all occurrences of s and t by P’s. This transformation introduces generaliza- 
tion based on context. For example, when applied to the productions “John 
throws baseballs” and “John catches baseballs,” it will propose “throws” 
and “catches” as alternatives for the same production. 
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3. Expansion: Remove a rule Y — >-s|i|...|t!by replacing every alternative 
that mentions Y with a set of alternatives in which Y is replaced with s, t, 
etc. This can reverse previous substitutions and disjunctions later on. 

4. Truncation: Remove alternatives having very low probability and redistribute 
their probability among the remaining alternatives. This can be used to re- 
move noise below some threshold. 

5. Normalization: Merge redundant alternatives (e.g. X — >■ s | s) and drop pro- 
ductions that are inaccessible (cannot be reached from the start symbol) or 
blocking (result in some nonterminal that cannot be rewritten) . This is often 
necessary to “clean up” grammars to show the full extent of simplification 
resulting from another transformation. 

Other variations on these transformations are also considered. For practical rea- 
sons, since the branching factor of possible search steps can be very large (some- 
times exceeding 100), we perform searching using a greedy deterministic hill- 
climbing strategy. Simulated annealing is another possibility we are examining. 



3.3 Example: Parenthesis Expressions 

To demonstrate the algorithm, we consider the language of balanced parenthesis 
strings. Take the set of all such strings up to length 6 as the training set, with 
frequencies as shown (to be justified later): 

sample = {(), ()(), (()), ()()(), ()(()), (())(), (()()), ((()))} 

[0.5,0.125,0.125,0.0625,0.03125,0.03125,0.0625,0.0625] . ^ 

The initial grammar, with complexity 99.68, is: 

^^01 00 I (0) I 000 I 0(0) I (0)0 I (00) I ((())) 

[0.5,0.125,0.125,0.0625,0.03125,0.03125,0.0625,0.0625] . 

Ten substrings are candidates for substitution: (), )(, ((, )), ()(,)(), ((), ()), ()(), 
and (0). The greatest reduction in complexity is obtained by substituting on (). 
Since it already appears as an alternative for S, we simply substitute S for () 
everywhere. The resulting grammar has complexity 88.09: 

S^{) \ SS \ (S) \ SSS I S{S) I {S)S I (SS) I {{S)) 

[0.5,0.125,0.125,0.0625,0.03125,0.03125,0.0625,0.0625] , ^ 

Now the repeated substrings are SS, {S, S), and (S). Choosing (S) gives: 



S^o \ SS \ (S) \ SSS I I I (55) I (5) 

[0.5,0.125,0.125,0.0625,0.03125,0.03125,0.0625,0.0625] . ^ 

After normalizing by merging redundant alternatives and summing their associ- 
ated probabilities, we obtain a grammar with complexity 42.19: 

5 ^ 0 I 55 I (5) I 555 | (55) 

[0.5,0.1875,0.1875,0.0625,0.0625] . 
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The final set of repeated substrings are SS, {S, and S'), of which SS lowers 
complexity the most. This gives: 



which after normalizing is the usual grammar for the parenthesis language: 



The final complexity is 20.51. Notice that if these probabilities are used to gen- 
erate a set of strings up to length 6, we recover the string frequencies in the 
original sample. In this example, only substitution and normalization operations 
were used, but in general other transformations may be needed as well. 

4 Information Extraction 

We tested our algorithm on a set of mock web pages containing London real 
estate listings. The pages all followed the same general layout (see Figs. [Hand El 
but contained varying numbers of listings on each page, some containing pictures 
of the described property and some without. The set of pages was taken as the 
training set and the algorithm attempted to construct a suitable description 
from which extraction wrappers could be generated. 

4.1 Grammatical Inference Phase 

The web pages in the training set were first converted to abstract strings over 
the alphabet {HTML tag types} U {text}. This was done by discarding HTML 
attributes from the tags encountered, so that tags of the same type (e.g. an- 
chor start tags) would be treated as the same alphabet symbol (e.g. a). Free 
text occurring between tags was converted to the symbol text. In using this 
transformation, we assume that structure is mainly present at the tag type level 
and focus on that level by ignoring variations in text and attributes (e.g. href 
values). For example, two contact links: 

<hrXa href="mailto : salesOa. com">A-l Realtors</a> 

<hrXa href="mailto:help@b.com">Bee Estate Agents</a> 

would both be transformed to the same abstract string, hr a text /a. 

Each page string then became an alternative in the initial grammar, which 
had perfect fit but high complexity (1056.32): 

html head... table tr td b text /b br text . . . /html 
S — >■ I html head . . . table tr td img b text /b br . . . /html 

I etc. 



s ^ 0 \ SS \ (S) \ SS \ (S) 

[0.5, 0.1875, 0.1875, 0.0625, 0.0625] , 



( 10 ) 



s ^ 0 \ SS \ (S) 

[0.5,0.25,0.25] . 



( 11 ) 



(12) 
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Netscape: Listings in Southwest 



File Edit View Go Communicator 



Southwest London 



Kings Road, SW6 

£120,000 

A quiet and secluded studio flat with a garden situated at the rear of this Victorian converted building in 
the heart of Fulham. Contact: 020-722-3322. 

Addison Gardens, SW14 
£124,000 

A particularly quiet and conveniently located studio flat with the great benefit of direct access to 
wonderful 'hidden’ communal gardens. Contact: 020-7431-1020. 

Cheval Court, SW15 

£120,000 

A reforbished first floor studio apartment with southerly views over gardens in a purpose built block with 
off street parking. Contact: 020-8879-7922. 



Jeffreys Road, SW4 

£130,000 

A good sized lower ground floor flat which would be ideal for a first time buyer. The flat has a spacious 
sitting room which forther benefits from having direct access to its own west facing patio. Jeffry Road 
runs between Larkhall Lane and Qapham Road. The closest tube station is Stockwell with transport links 
to Victoria. Patio. 1 reception, 1 bedroom, 1 bathroom. Leasehold. Contact: 0800-919-308. 



April 2001 

1^ I ESI 




HBB 



Help 



Fig. 1. A sample real estate listing page 



<htmlxhead> 

<title>Listings in Southwest London</title> 

</head> 

<body> 

<hl>Southwest London</hl> 

<table border=l width=100°/,> 

<tr><tdXb>New Kings Road, SW6</bXbr> 

&pound ; 120 , 000<br> 

A quiet and secluded studio flat with a garden situated at 
the rear of this Victorian converted building in the heart 
of Fulham. Contact: 020-722-3322. 

</tdX/tr> . . . 



Fig. 2. Part of the HTML source for Fig.Q 



The inference algorithm ran in five seconds on a Pentium 233 and examined 
386 candidate grammars, lowering the complexity to a value of 119.19 (see Fig. 0 
for the quality curve). The final grammar was: 

S — >■ html head title text /title /head body hi text /hi 
table T /table p address text /address /body /html 
T — >■ TT I tr td [/ b text /b br text br text /td /tr 
[/ — >■ e I img br . 



( 13 ) 
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Steps 



Fig. 3. Quality curve for the real estate grammar search 



We can interpret this structure as follows. The start symbol S represents 
a complete page. A page begins with a fixed header, followed by one or more 
occurrences of T, each of which represents a single listing. A listing consists 
of a table row of data optionally containing an image U. Finally, the page is 
terminated by a fixed trailer. 

In this process we have not made use of the DTD (document type definition) 
specification which defines the HTML language. Since by definition the HTML 
DTD is a general structure describing all HTML documents, it is not useful as 
a close fit to any particular set of pages. However, taking account of constraints 
from the DTD during transformations (for example, to keep blocks of elements 
from being split) might be a useful refinement to the algorithm. Alternately, a 
different approach altogether might be to start from the general HTML DTD 
and specialize it to the training set, rather than starting from a specific grammar 
and generalizing. 



4.2 Domain- Specific Phase 

The grammatical inference phase performs a coarse segmentation of the page into 
units of varying sizes, corresponding to different nonterminals. To complete the 
segmentation, we need to apply domain-specific knowledge to determine which 
units correspond to records and to segment the fields within records. 

Domain knowledge is expressed as declarative information extraction rules 
for domain fields. Each rule consists of a field name and type plus a regular 
expression defining the context in which that field might appear. Rules are exe- 
cuted by applying the regular expression to a chunk of text. If a match is found. 
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a specified portion of the matching text is extracted as the value of the field. 
Disjunctions can be used to define multiple contexts for a field. For example, in 
the real estate domain we might have a default set of rules such as the following: 

number price = (fepound; I &#163 ; I i') { [0-9] +} 
string telephone = { [0-9] +- [0-9] +- [0-9] +} 
boolean garden = garden I yard 

The first rule says that a price looks like some form of pound sign followed 
by a number. The part of the match delimited by braces is returned as the field 
value. The second declares a telephone number as a string of digits interspersed 
with dashes, all of which becomes the field value. The third defines a boolean 
attribute which is true if either of the strings “garden” or “yard” is present. 

These rules are then applied to the units discovered by the grammatical infer- 
ence phase. More precisely, the following procedure is used. Parse the training 
pages according to the inferred grammar. For each occurrence of a nontermi- 
nal, collect all of the text appearing below it in the parse tree into an associated 
chunk. For the page shown in Fig.^ the nonterminal S would be associated with 
one chunk containing all of the text on the page; T would be associated with 
four chunks, each containing the text of one listing; and U would be associated 
with four chunks containing no text. 

Now apply the information extraction rules to each chunk. If a chunk yields 
multiple matches for several rules, as S does, it probably contains more than 
one record. However, if few or no rules match, as in U, the chunk is proba- 
bly smaller than a record. The nonterminal matching the most rules without 
duplicate matches is assumed to correspond to a record — in this case, T. 

4.3 Wrapper Generation Phase 

Having identified the nonterminal T as corresponding to a listing record, we 
can now compile the grammar into a wrapper that extracts records from pages 
as chunks of text and applies domain rules to extract typed fields from those 
records. At this point, the user can manually add additional site-specific rules for 
fine-tuning. These rules may be optionally qualified by a piece number to restrict 
the match range to a particular piece (i.e. the section of text corresponding to a 
specific text symbol) within a chunk. For example, the following rules: 

string address = 1 : {.*}, 

string description = 3 

specify that the address is the part of the first piece appearing before a comma, 
and that the description is the entire content of the third piece. 

Running the wrapper on the sample page yields the records shown in Tabled 
Once wrappers for all of the data sources to be used have been generated, the 
system can extract records from each and integrate the resulting data into a 
combined database. If a partial database is already available, it may be of use 
in helping to identify domain fields and formulate extraction rules for them. 
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Table 1. Partial listing of extracted records 



Address 


Price 


Garden 


Description 


New Kings Road 


120,000 


yes 


A quiet and secluded studio Hat. . . 


Addison Gardens 


124,000 


yes 


A particularly quiet and convenient. . . 


Cheval Court 


120,000 


yes 


A refurbished first floor studio. . . 


Jeffreys Road 


130,000 


no 


A good sized lower ground floor. . . 



Note that complete integration may not be possible, by nature of the data’s 
origin in multiple collections of semistructured text. Some extraction rules may 
fail on some records, and not all fields may be present on all sites in the first 
place or even present in all records within a site. However, a mostly-complete 
overview can still be of significant value. 

As a final step, this database can then be used as input into a data mining 
or information integration system. See for example which describes an in- 
teractive real estate visualization system based on dynamic queries (see Fig. 0] 
for a screenshot). In this system, sliders continuously set selection criteria for 
properties (shown as color-coded points on a map) and can be used to naturally 
and rapidly explore the data space. 

Although these preliminary results are qualitatively encouraging, more rig- 
orous testing remains to be done to quantify performance on real-world data 
in terms of recognition rates, etc. Further work is also necessary to determine 
how robust the method is under different conditions and whether it might get 
stuck in local optima (switching to simulated annealing may be useful) or have 
difficulty identifying records properly. 

5 Related Work 

AhonenPI has used grammatical inference to generate structural descriptions for 
tagged SGML documents such as dictionaries and textbooks, while Freitag^U] 
explored the use of grammatical inference to find field boundaries in free text. 
A large amount of work has been done on developing inference algorithms; HH 
presents a useful overview. 

Work on integrating data from multiple websites has been carried out by 
a number of researchers. One of the first, Krulwich’s Ha.rgainFinder jTC]. was 
able to scan product listings and prices from a set of on-line web stores and 
extract them into a unified ordered table. However, it was based entirely on 
hand-coded wrappers tailored specifically to each source site. ShopBot0 went a 
step further by using various ad hoc heuristics to automate wrapper building for 
online stores, but was extremely domain-specific. Kiishmerickjl tij extended this 
work by defining some classes of wrappers that could be induced from labelled 
examples, while Ashish and Knoblockp] built a toolkit for semi-automatically 
generating wrappers using hardcoded heuristics. 
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Fig. 4. Screenshot of an interactive real estate visualization system 



Craven et al. 0 describe an inductive logic programming algorithm for learn- 
ing wrappers, also using labelled examples. Cohen 0 introduced a method for 
learning a general extraction procedure from pairs of page-specific wrappers and 
the pages they wrap, although the method was restricted to simple list structures. 
AutoWrapper El induces wrappers from unlabelled examples but is restricted 
to simple table structures. Ghani et al. m combined extraction of data from cor- 
porate websites with data mining on the resulting information. The TSIMMIS 
project]^ is another system aimed at integrating web data sources; however, its 
main focus is on query planning and reasoning about source capabilities rather 
than information extraction (performed by hand-coded wrappers) . 



6 Conclusions and Future Work 

In conclusion, we have demonstrated a principled method for generating informa- 
tion extraction wrappers using grammatical inference that enables the integra- 
tion of information from multiple web sources. Our approach does not require the 
overhead of manually-labelled examples, should be applicable to general struc- 
tures, and ought to be easily adaptable to a variety of domains using domain 
knowledge expressed in simple declarative rules. 

These are still preliminary results, and further work is necessary to test the 
inference algorithm on more complicated web pages from real-world sources in 
different domains, and to conduct a more rigorous quantitative evaluation. We 
would also like to examine the use of simulated annealing in the search. 
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Abstract. Biologists have determined that the control and regulation of 
gene expression is primarily determined by relatively short sequences in 
the region surrounding a gene. These sequences vary in length, position, 
redundancy, orientation, and bases. Finding these short sequences is a 
fundamental problem in molecular biology with important applications. 
Though there exist many different approaches to signal/motif (i.e. short 
sequence) finding, in 2000 Pevzner and Sze reported that most current 
motif finding algorithms are incapable of detecting the target signals in 
their so-called Challenge Problem. In this paper, we show that using an 
iterative-restart design, our new algorithm can correctly find the targets. 
Furthermore, taking into account the fact that some transcription factors 
form a dimer or even more complex structures, and transcription process 
can sometimes involve multiple factors, we extend the original problem 
to an even more challenging one. We address the issue of combinatorial 
signals with gaps of variable lengths. To demonstrate the efficacy of our 
algorithm, we tested it on a series of the original and the new challenge 
problems, and compared it with some representative motif-finding algo- 
rithms. In addition, to verify its feasibility in real-world applications, we 
also tested it on several regulatory families of yeast genes with known 
motifs. The purpose of this paper is two-fold. One is to introduce an 
improved biological data mining algorithm that is capable of dealing 
with more variable regulatory signals in DNA sequences. The other is to 
propose a new research direction for the general KDD community. 



1 Introduction 

Multiple various genome projects have generated an explosive amount of biose- 
quence data; however, our biological knowledge has not been able to increase in 
the same pace of the growth of biological data. This imbalance has stimulated 
the development of many new methods and devices to address issues such as 
annotation of new genes [1][2]. Once the Human Genome Project is completed, 
it can be expected that related experiments will be carried out soon. The tough 
computational challenges resulting from large-scale genomic experiments lie in 
the specificity and complexity of the biological processes, e.g., how we identify 
the genes directly involved in diseases, how these genes function, and how these 
genes are regulated, etc. Answers to the questions above are absolutely related 
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to the future of health care and genomic medicine that will lead to personalized 
therapy. The success of the future health care will definitely affect the entire hu- 
man race in terms of life quality and even life span. Though the content of this 
paper is focused on one specific biological problem, another important objective 
of this paper is to draw the attention of the general KDD community to a new 
research area which needs considerable efforts and novel techniques from a wide 
variety of research fields, including KDD. 

A cluster of co-regulated genes isolated by gene expression measurements 
can only show which genes in a cell have similar reaction to a stimulus. What 
biologists further want to understand is the mechanism that is responsible for 
the coordinated responses. The cellular response to a stimulus is controlled by 
the action of transcription factors. A transcription factor, which itself is a special 
protein, recognizes a specific DNA sequence. It binds to this regulatory site to 
interact with RNA polymerase, and thus to activate or repress the expression 
of a selected set of target genes. Given a family of genes characterized by their 
common response to a perturbation, the problem we try to solve is to find these 
regulatory signals (aka motifs or patterns), i.e. transcription factor binding sites, 
that are shared by the control regions of these genes. 

It has been determined that the control and regulation of gene expression 
is primarily determined by relatively short sequences in the region surrounding 
a gene. These sequences vary in length, position, redundancy, orientation, and 
bases. In any case these characteristics make the problem computationally dif- 
ficult. For example, a typical problem would be: given 30 DNA sequences, each 
of length 800, find a common pattern of length 8. Let us simplify the problem, 
as many algorithms do, and assume the pattern occurs exactly once in each 
sequence. This means that there are approximately 800^° potential locations 
for a motif candidate. Research on finding subtle regulatory signals has been 
around for many years, and still draws a lot of attention because it is one of the 
most fundamental but important step in the study of genomics [3-9]. Despite 
that there already exist many various algorithms, this problem is nevertheless 
far from being resolved [10]. They found several widely used motif- finding algo- 
rithms failed on the Challenge Problem as follows. 

Let S = {si,...,st} be a sample of t n-letter sequences. Each sequence con- 
tains an {I, d)-signal, i.e., a signal of length I with d mismatches. The problem is 
how to find the correct (I, d)-signal. 

In their experiments, they implanted a (15,4)-signal in a sample of 20 sequences. 
To verify the effect of the sequence length, they varied n from 100 to 1000. 
The experimental results showed that as the sequence length increased, the per- 
formance of MEME [3], CONSENSUS [4] and the Gibbs sampler [5] decreased 
dramatically. There are two causes to their failures. First, the algorithms may 
lodge in local optima. The increase of the sequence length can incur more local 
optima, and further aggravates the problem. Second, they rely on the hope that 
the instances of the target signal appearing in the sample will reveal the signal 
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itself. However, in the Challenge Problem, there are no exact signal occurrences 
in the sample, only variant instances with 4 mismatches instead. Pevzner and Sze 
proposed WINNOWER and SP-STAR to solve the Challenge Problem, but the 
applicability of WINNOWER is limited by its complexity and the performance 
of SP-STAR drops significantly like others as the sequence length increases. 

Due to the fact that transcription factors may form a dimer or more complex 
structures, and some transcription initiations may require the binding of two or 
more transcription factors at the same time, we further extend the Challenge 
Problem by addressing the issue of combinatorial signals with gaps of variable 
lengths. Most of the current approaches can only find motifs consisting of contin- 
uous bases without gaps. Some methods have been proposed to deal with motifs 
or alignments with gaps, but they either limit the focus on fixed-gaps [11-13] 
or use other less expressive representations than the weight matrix, e.g., regular 
expression- like languages or the lUPAC code [14] [15]. To alleviate the limitations 
of current approaches, we introduce a new algorithm called MERMAID, which 
adopts the matrix for motif representation, and is capable of dealing with gaps of 
variable lengths. This presentation expands upon work by others by combining 
multiple types of motif significance measures with an improved iterative sam- 
pling technique. We demonstrate its effectiveness in both the original and the 
extended Challenge Problems, and compare its performance with that of several 
other major motif finding algorithms. To verify its feasibility in real-world ap- 
plications, we also tested MERMAID on many families of yeast genes that share 
known regulatory motifs. 



2 Background 

There are three main interrelated computational issues: the representation of a 
pattern, the definition of the objective function, and the search strategy. While 
we examine the algorithms on computational grounds, the final, gold-standard 
is how well the algorithm does at predicting motifs. 



2.1 Representation 

As the primary DNA sequences are described by a double-stranded string of nu- 
cleic bases {A,C,G,T}, the most basic pattern representation is the exact base 
string. Due to the complexity and flexibility of the motif binding mechanism, 
there is rarely any motif that can be exactly described by a string of nucleic 
bases. To obtain more flexibility, the lUPAC code was designed, which extends 
the expressiveness of the simple base string representation by including all dis- 
junctions of nucleotides. In this language there is a new symbol for each possible 
disjunction, e.g. W represents A or T. 

A more informative pattern representation is a probability matrix in which 
each element reflects the importance of the base at a particular position. Such 
matrices can be easily translated into the lUPAC code, while the converse is 
not true. These matrices are often transformed from the observed occurrence 
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frequencies. For example, in the NIT regulatory family [6] which contains 7 
members, a possible 6-base motif matrix is illustrated in Fig. 1. The normalized 
matrix is also shown in this figure. 



A 0 7 0 7 7 0 

G 6 0 0 0 0 7 normalized to 

C 1 0 0 0 0 0 

T 0 0 7 0 0 0 



A 0.00 1.00 0.00 1.00 1.00 0.00 
G 0.86 0.00 0.00 0.00 0.00 1.00 
C 0.14 0.00 0.00 0.00 0.00 0.00 
T 0.00 0.00 1.00 0.00 0.00 0.00 



Fig. 1. A 6-base Motif Matrix Example 



2.2 Objective Function 

The purpose of an objective function is to approximate the biological meanings 
of the patterns in terms of a mathematical function. The objective function are 
heuristics. Once the objective function is determined, the goal is to find those 
patterns with high objective function value. Different objective functions have 
been derived from the background knowledge, such as the secondary structures 
of homologous proteins, the relation between the energetic interactions among 
residues and the residue frequencies, etc [17] [18]. Objective functions based on 
the information content or its variants were proposed [4] [5] . Others evaluate the 
quality of the pattern by its likelihood or by some other measures of statistical 
significance [3] [13]. 

Even though there are many different objective functions currently used, it is 
still unclear what is the most appropriate object function or the best representa- 
tion for patterns that will correspond to biological significant motifs. More likely, 
additional knowledge will need to be incorporated to improve motif characteri- 
zation. In the final analysis, the various algorithms can only produce candidate 
motifs that will require biological experiments to verify. 



2.3 Search Strategy 

If one adopts the exact string representation, then one can exhaustively check 
every possible candidate. However this approach is only able to identify short 
known motifs or partial long motifs [13]. Therefore, the primary representation 
used is a probability matrix [3] [4] [5] [7]. Once one accepts a probability matrix as 
the representation, then there is no possibility for an exhaustive search. Initial 
approaches started with hill-climbing strategies, but these typically fell into local 
optimum. Standard approaches to repairing hill-climbing, such as beam and 
stochastic search, were tried next [4]. The current approaches involve a mixture 
of sampling and stochastic iterative improvement. This avoids the computational 
explosion and maintains or improves the ability to find motifs [3] [5] [7]. 
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3 MERMAID: Matrix-Based Enumeration and Ranking 
of Motifs with gAps by an Iterative-Restart Design 

According to the objective function they apply, most current approaches based 
on greedy or stochastic hill-climbing algorithms optimize the probability matrix 
with all positions within a sequence [4] [5]. This is not only inefficient, but may 
also increase the chance of getting trapped in local optima in case of subtle 
signals contained in long sequences due to a greater number of similar random 
patterns coexisting in the sequences. To avoid this drawback, we can begin by 
allowing each substring of length Z to be a candidate signal. We then convert this 
particular substring into a probability matrix, adopting an idea from [3]. This 
gives us a set of seed probability matrices to be used as starting points for iter- 
ative improvement. We use the seed probability matrix as a reference to locate 
the potential signal positions with match scores above some threshold. The op- 
timization procedure only checks these potential positions instead of all possible 
locations in a sequence. By directing the attention to the patterns same as or 
close to the substring that is considered a motif candidate, we can significantly 
constrain the search space during the iterative improvement process. 

Nevertheless, when the target signal is very subtle, e.g., (15,4)-signal, the way 
that we only consider the selected potential signal positions becomes biased. 
This bias is based on the assumption that the instances of the target signal 
existing in the sample have sufficient regularity so that we can finally derive the 
correct target signal from these instances through optimization. Unfortunately, 
this optimistic assumption does not hold if the regularity represented by the 
signal instances is inadequate to distinguish themselves from similar random 
patterns. As a consequence, the chance of mistaking random patterns for real 
signal instances gets higher. The optimization process may thus converge to 
other variant patterns than the correct signal. 

When dealing with subtle signals, a stochastic approach is not guaranteed to 
find the correct target signal owing to the influence of similar random patterns. 
However, the pattern it converges to must be close to the target itself because the 
random patterns must carry some resemblance to the target signal; otherwise, 
they would not be selected to participate in the optimization process. Suppose 
the target signal is the most conserved pattern in the sample as usually expected 
and we use one signal instance as the seed for optimization. No matter what 
pattern it finally converges to, this pattern is at least closer to the target signal 
than the substring (i.e. the signal instance in the sample) used as the seed even 
if it is not the same as the target. Since the converged pattern is closer to the 
target signal, one way to further refine this pattern is to reuse it as a seed, and 
run through the optimization again. We can iteratively restart the optimization 
procedure with the refined pattern as a new seed until no improvement is shown. 
With this iterative restart strategy, we expect to successfully detect subtle signals 
like {I, d)-signals in the Challenge Problem. 

Pevzner and Sze proposed some extension to SP-STAR to deal with gapped 
signals [10], but their method typically addressed the fixed-gap issue only. How- 
ever, in some real domains, motifs may contain gaps of variable lengths, and 
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simultaneous and proximal binding of two or more transcription factors may 
be required to initiate transcription[9] [14]. Therefore, a natural extension to the 
Challenge Problem is to find combinatorial (?, d)-signals. A combinatorial {l,d)~ 
signal signal may consist of multiple {I, d)-signals as its components, and the 
length of gap between two components may vary within a given range. For ex- 
ample, a {l,d)-X{m,n)-{l,d)-signal is one that has two (I, d)-signals with a gap 
of variable lengths between m and n bases. Note that the signal length and the 
number of mutations may be different in various components. 

There are generally two approaches to finding combinatorial signals. The first 
is a two-phase approach. We find signal component candidates in the first phase. 
In the second phase, we use the component candidates to form and verify signal 
combinations [16]. This approach is effective when the signal components are 
significant enough per se so they can be identified in the first phase for later 
combination check. In cases that the signal components gain significance only in 
combinations, the former approach may overlook the interaction between com- 
ponents and thus fail to find meaningful combinations. To avoid this limitation, 
an alternative approach is to find combinatorial signals directly. We developed 
MERMAID (Matrix-based Enumeration and Ranking of Motifs with gAps by 
an Iterative-restart Design) to deal with subtle combinatorial signals. 

The main process flow of MERMAID is divided into four steps. Given a 
biosequence family, it first translates substring combinations into matrices. We 
convert this particular substring into a probability matrix in two steps, adopting 
an idea from [3]. First we fix the probability of every base in the substring to 
some value 0 < A < 1, and assign probabilities of the other bases according 
to (4 nucleic bases). Following Bailey and Elkan, we set X to 0.5. We also 
tried setting X to 0.6. The result showed no significant difference. Each matrix 
represents a component of a combinatorial motif. This step gives us a set of seed 
probability matrices to be used as starting points for iterative improvement. 
Second, it filters the potential motif positions in the family of sequences. Note 
that each single motif is derived from a substring combination. Thus, besides 
the matrices, MERMAID also keeps the locations of substrings for all potential 
motifs to deal with the flexible gaps. Third, given the set of potential motif 
positions that include the location of each motif component (i.e. substring), it 
performs an iterative stochastic optimization procedure to find motif candidates. 
Finally, it ranks and reports these candidates according to the motif significance 
that is based on the combination of different types of quality measures, including 
consus [4], multiplicity [6] [13] and coverage [7]. The consus quality is derived from 
the relative entropy, which is used to measure how well a motif is conserved. 
The multiplicity is defined as the ratio of the number of motif occurrences in 
the family to that in the whole genome. This measures the representativeness 
of a motif in a family relative to the entire genome, and consequently, discounts 
motifs which are common everywhere, such as tandem repeats or poly A’s. We 
define motif coverage as the ratio of the number of the sequences containing 
the motif to the total number of biosequences in the family. This reflects the 
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importance of a motif’s being commonly shared by functionally related family 
members. Due to limited space, please refer to [16] for more details. 

A pseudocode description of the iterative-restart optimization procedure in 
MERMAID is given in Fig. 2. Let n be the sequence length. The pseudocode 
(4)-(9) scan the entire sample against each matrix m to find the highest match 
scoring substring combination in each sequence, locate the potential positions 
of the combinatorial motif, and form an initial matrix combination M. These 
totally take 0{n ■ ■ [S']) operations, where G is the maximum gap range 

and N is the total number of motif components. Let p be the maximum number 
of potential positions in a sequence, p typically <C n. The inner repeat-loop (10)- 
(14) takes {p ■ L) operations to check different positions, where L is a constant 
for the cycle limit. Pseudocode (15)-(19), which scan the entire sample against 
matrix M to isolate signal repeats, and form the final probability matrix FM, 
also take 0{n-G^~^ • [S']) operations. From above, the outer repeat-loop (3)-(21) 
totally takes 0{L{2n ■ G^~^ ■ [S’] + pL)) = 0{n ■ G^~^ ■ [S']). Now considering 
the outer for-loop (1)-(21) and (22)-(23), we conclude the whole procedure is 
bounded by 0(n- G^~^ • [S'] • n - G^~^ ■ [S']) = 0((n ■ G^~^ ■ jS'j)^). When G and 
N are relatively small, 0{{n ■ G^~^ ■ jS'j)^) = 0{{n ■ jS'j)^), which is the same as 
MEME and SP-STAR, but lower than WINNOWER’S 0((n- jS'j)^’''^), where k 
is the clique size, fc > 2 in general. 



4 Experimental Results 

One of the goals of this paper is to demonstrate that enhanced by applying an 
iterative restart strategy, our new motif detection algorithm is able to find subtle 
signals, e.g. (15,4)-signal. Based on its definition, we reproduced the Challenge 
Problem, and used it to compare our new algorithm with others. 

Pevzner and Sze’s study [10] showed that for a (15,4)-signal, CONSENSUS, 
the Gibbs sampler and MEME start to break at sequence length 300-400bp. 
Their system called SP-STAR breaks at length 800 to 900, and their other al- 
gorithm named WINNOWER performs well through the whole range of lengths 
till lOOObp. Using the same data generator to create data samples (thanks to Sze 
for providing the program), we demonstrate our new algorithm is competitive 
with other systems. We tested MERMAID over eight samples, as Pevzner and 
Sze did, each containing 20 i.i.d. sequences of length lOOObp. The comparison 
of performance of the various algorithms is shown in Table 1. The numbers in 
Table 1 present the performance coefficients as defined in [10] averaged over 
eight samples. Let K be the set of known signal positions in a sample, and let 
P be the set of predicted positions. The performance coefficient is defined as 
|ArnP|/|A:uP|. 

Moreover, in order to show that it is the synergy of the iterative restart 
strategy and the optimization procedure combined with the multiple objective 
functions in MERMAID that helps find the subtle signals, we implanted in the 
sample the motif found by MEME with minimum mismatches to the target 
signal at a random position. We then reran MEME. We repeated the above 
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Given: a set of biosequences, S 

the total width of a combinatorial motif, W (excluding gaps) 
the maximal gap range , G 

the number of components in a combinatorial motif, N 
the cycle limit , L 

Return: a set of ranked motif candidates, C 



(1) For each substring combo s in S Do 

(2) Set s to ss as a seed 

(3) Repeat 

(4) Translate each substring in ss into candidate probability 

matrix m via: 

m(i,base) = . 50 if base occurs in position i 
= . 17 otherwise 

(5) Find highest match scoring substring combo in each sequence in S 

(6) Compute the mean of the highest match scores in S 

(7) For each sequence in S Do 

(8) Set Potential Positions to those with match score >= mean 

(9) Randomly choose a Potential Position in each sequence 

to initialize matrix combo M 

(10) Repeat 

(11) Randomly pick a sequence s in S 

(12) Check if M’s significance can be improved by using a 

different Potential Position in s 

(13) Update matrix combo M 

(14) Until (no improvement in M’s consensus) or (reach the 

cycle limit L) 

(15) Compute the mean of match scores of substring combo 

contributing to M 

(16) For each sequence s in S Do 

(17) Isolate motif repeats to those with match score >= mean 

(18) Form the final matrix combo FM with all repeats in S 

(19) Convert matrix combo FM into string combo ss as a new seed 

(20) Until (no improvement in FM’s significance) or (reach the 

cycle limit L) 

(21) Put FM in C 

(22) Sort all motif candidates in C according to significance 

(23) Return C 



Fig. 2. Pseudocode of MERMAID 
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Table 1. Comparison of performance for (15,4)-signals in 20 i.i.d. sequences of length 
lOOObp 



CONSENSUS 


Gibbs 


MEME 


MEME 

(w/ iterative restart) 


oligonucleotide 

analysis 


WINNOWER 
(clique size is 3) 


SP-STAR 


MERMAID 


0.06 


0.11 


0.02 


0.09 


0.00 


0.88 


0.23 


0.75 



Table 2. Performance of MERMAID for (6,l)-X(m,n)-(6,l)-signal in 20 i.i.d. sequences 
of length lOOObp 



g = 3 


g = 5 


g = 7 


g = 9 


0.91 


0.88 


0.90 


0.56 



process, and checked whether this iterative restart strategy alone could improve 
MEME’s performance. The reason we tested MEME is that MERMAID adopts 
the same motif enumeration method as MEME. Since MEME exhaustively tests 
every substring in the sample, the implanted substring will be used as a seed in 
the next run. We only implanted the motif closest to the real signal (i.e., mini- 
mum mismatches) to ensure that the base distribution in the sample was nearly 
unchanged. Though we did not actually re-code MEME, this approximate sim- 
ulation could still effectively reflect its performance. The result is also presented 
in Table 1. 

Table 1 indicates that MERMAID outperforms CONSENSUS, the Gibbs 
sampler and MEME (with or w/o iterative restart) by a significant scale. Note 
that the performance coefficients of WINNOWER and SP-STAR reported in 
(Pevzner and Sze, 2000) are included only for reference because we did not have 
access to these two systems at the time. However, this indirect evidence may 
suggest that MERMAID performs better than SP-STAR, and is expected to be 
comparable with WINNOWER. We also tested MERMAID on ten real regulons 
collected by van Helden et. al. [6] to verify its usefulness in finding motifs in 
real-world domains. MERMAID successfully identified all the known motifs in 
each regulon. 

For the extended Challenge Problem, we tested MERMAID on (6, 1)-X{m, n)- 
(6, 1)-signals in a set of 20 sequences of length lOOObp, where m and n were 
varied to form a gap ranging from three to nine bases. The experimental results 
are presented in Table 2, in which g presents the gap range. It shows that the 
performance of MERMAID is quite stable till the gap length reaches nine. 

In addition to the artificial problem, we also tested MERMAID on several real 
regulons [13] in which the known binding sites have fixed gaps. The summary 
of the regulons is presented in Table 3, and we show the results in Table 4. 
In the fourth column of Table 4, the number within the brackets presents the 
rank of the signal found by MERMAID. Converting the matrices found into the 
lUPAC codes, we compared them with the published motifs, and found they 
have significant similarity. The known motifs in the regulatory families are all 
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Table 3. Summary of regulons used in the experiments 



Family 


Genes 


GAL4 


GALl GAL2 GAL7 GAL80 MELl GCYl 


CATS 


ACRl ICLl MLSl PCKl FBPl 


HAPl 


CYB2 CYCl CYC7 CTTl CYTl ERGll HEM13 HMGl ROXl 


LEU3 


GDHl ILVl LEUl LEU2 LEU4 


LYS 


LYSl LYS2 LYS4 LYS9 LYS20 LYS21 


PPRl 


URAl URA3 URA4 


PUTS 


PUTl PUT2 



ranked in the top ten. The experimental results indicate MERMAID, which 
was originally developed to deal with variable gaps, also performs well on real 
domains where motifs have fixed gaps. 

5 Conclusion and Future Work 

In this paper we have described a new subtle signal detection algorithm called 
MERMAID, which iteratively restart a multi-strategy optimization procedure 
combined with complementary objective functions to find motifs. The exper- 
imental results show that the system performs significantly better than most 
current algorithms in the Challenge Problem. To argue the success of MER- 
MAID is attributed to the synergy of iterative restart and other components in 
the system, i.e. optimization procedures and objective functions, we have demon- 
strated that simply attaching a iterative restart strategy with MEME shows little 
improvement. 

The difficulty of finding the biologically meaningful motifs results from the 
variability in (1) the bases at each position in the motif, (2) the location of 
the motif in the sequence and (3) the multiplicity of motif occurrences within 
a given sequence. In addition, the short length of many biologically significant 
motifs and the fact that motifs gain biological significance only in combinations 
make them difficult to determine. MERMAID was developed to deal with subtle 
combinatorial signals. Our experiments showed MERMAID successfully detected 
combinatorial signals composed of proximal components as well as the known 
motifs with gaps in many real regulons of yeast genes. 

For the future work, we aim to improve MERMAID in two directions. One 
is efficiency and the other is flexibility. First, the optimization process in MER- 
MAID for a single candidate is independent of each other. Therefore, MERMAID 
can be easily implemented on a parallel or distributed system to improve its ef- 
ficiency. Second, MERMAID only performs well on combinatorial signals with 
gaps within a relatively tight range. A wider range of gap length produces a 
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larger search space for motif-finding algorithms, and in such cases, it is compu- 
tationally prohibited to enumerate all possibilities exhaustively. We thus plan 
to apply a second stochastic sampling process to search through the space of 
variable gaps, and incorporate domain knowledge when available to constrain 
the search space. 
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Abstract. Fuzzy association rules provide a data mining tool which is 
especially interesting from a knowledge-representational point of view 
since fuzzy attribute values allow for expressing rules in terms of nat- 
ural language. In this paper, we show that fuzzy associations can be 
interpreted in different ways and that the interpretation has a strong 
influence on their assessment and, hence, on the process of rule mining. 
We motivate the use of multiple- valued implication operators in order to 
model fuzzy association rules and propose quality measures suitable for 
this type of rule. Moreover, we introduce a semantic model of fuzzy as- 
sociation rules which suggests to consider them as a convex combination 
of simple association rules. This model provides a sound theoretical ba- 
sis and gives an explicit meaning to fuzzy associations. Particularly, the 
aforementioned quality measures can be justihed within this framework. 



1 Introduction 

Association rules, syntactically written A ^ B, provide a means for representing 
dependencies between attributes in databases. Typically, A and B denote sets 
of binary attributes, also called features or items. The intended meaning of a 
(binary) rule A ^ B is that a transaction (a data record stored in the database) 
that contains the set of items A is likely to contain the items B as well. Several 
efficient algorithms for mining association rules in large databases have been 
devised [1,19,21]. Typically, such algorithms perform by generating a set of 
candidate rules from selected itemsets which are then filtered according to several 
quality criteria. 

Generally, a database does not only contain binary attributes but also at- 
tributes with values ranging on (completely) ordered scales, e.g. cardinal or or- 
dinal attributes. This has motivated a corresponding generalization of (binary) 
association rules. Typically, a quantitative association rule specifies attribute 
values by means of intervals, as e.g. in the simple rule “Employees at the age of 
30 to 40 have incomes between $50,000 and $70,000.” 

This paper investigates fuzzy association rules, which are basically obtained 
by replacing intervals in quantitative rules by fuzzy sets (intervals). The use of 
fuzzy sets in connection with association rules - as with data mining in general 
[20] - has recently been motivated by several authors (e.g. [2, 3, 5-8, 13, 15, 23, 
25]). Among other aspects, fuzzy sets avoid an arbitrary determination of crisp 
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boundaries for intervals. Furthermore, fuzzy associations are very interesting 
from a knowledge representational point of view: The very idea of fuzzy sets is 
to act as an interface between a numeric scale and a symbolic scale which is 
usually composed of linguistic terms. Thus, the rules discovered in a database 
might be presented in a linguistic and hence comprehensible and user-friendly 
way. Example: “Middle-aged employees dispose of considerable incomes.” 

Even though fuzzy association rules have already been considered by some 
authors, the investigation of their semantics in the context of data mining has 
not received much attention as yet. This is somewhat surprising since a clear 
semantics is a necessary prerequisite, not only for the interpretation, but also 
for the rating and, hence, for the mining of fuzzy association rules. 

The semantics of fuzzy associations and their assessment by means of ad- 
equate quality measures constitute the main topics of the paper. By way of 
background. Section 2 reviews the aforementioned types of association rules. In 
Section 3, we discuss quality measures for fuzzy associations. In this connection, 
it is shown that a generalization of (quantitative) association rules can proceed 
from different perspectives, which in turn suggest different types of measures. 
We especially motivate the use of multiple-valued implication operators in order 
to model fuzzy association rules. In Section 4, we introduce a semantic model 
of fuzzy associations which considers them as convex combinations of simple 
association rules. This model clarifies the meaning and provides a sound theo- 
retical basis of fuzzy association rules. Particularly, the aforementioned quality 
measures can be justified within this framework. 



2 Association Rules 



2.1 Binary Association Rules 



Consider an association rule of the form A ^ B, where A and B denote subsets 
of an underlying set A of items (which can be considered as binary attributes). 
As already said above, the intended meaning of A B is that a transaction 
T C A which contains the items in A is likely to contain the items in B as well. 

In order to find “interesting” associations in a database D, a potential rule 
A ^ B is generally rated according to several criteria, none of which should 
fall below a certain (user-defined) threshold. In common use are the following 
measures {Dx = {T G D\X C T} denotes the transactions in the database D 
which contain the items X G A, and \Dx\ is its cardinality): 

— A measure of support defines the absolute number or the proportion of trans- 
actions in D containing AU B: 

supp{A^ B) = \Daub\ or supp(A ^ B) = ■ (1) 

— The confidence is the proportion of correct applications of the rule: 



conf(A B) 



\Daub\ 

~\D^ 



( 2 ) 
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— A rule A ^ B should be interesting in the sense that it provides new infor- 
mation. That is, the occurrence of A should indeed have a positive influence 
on the occurrence of B. A common measure of the interest of a rule is 

^ ^ \Daub\ \Db\ 

This measure can be seen as an estimation of Pr(i3 | A) — Pr{B), that is the 
increase in probability of B caused by the occurrence of A. 



2.2 Quantitative Association Rules 

In the above setting, a transaction T can be seen as a sequence {x \, . . . , Xm) of 
values of binary variables Xi with domain Dxt = {0, 1}, where Xi = T[Aj] = 1 
if the zth item is contained in T and = 0 otherwise. Now, let X and 
Y be quantitative attributes (such as age or income) with completely ordered 
domains '£>x and Sy, respectively. Without loss of generality we can assume 
that !Dx,Sv C 1H. a quantitative association rule involving the variables X 
and Y is then of the following form: 

A^ B : li X & A = [xi,X 2 ] then Y G i? = [?/i , j/ 2 ] , (4) 

where a;i,a :2 G Sx and yi,y 2 G 2)y. This approach can simply be generalized 
to the case where X and Y are multi-dimensional vectors and, hence, A and 
B hyper-rectangles rather than intervals. Subsequently, we proceed from fixed 
variables X and F, and consider the database D as a collection of data points 
(x,y) = {T[X],T\Y\), i.e. as a projection of the original database. 

Note that the quality measures from Section 2.1 are applicable in the quan- 
titative case as well:^ 



supp(A B) 
conf(A B) 



|({(x,y) & D\x & A^y & B}\, 

\{{x,y) € D\x € AAy € B}\ 
\{{x,y) G D\x € A}\ 



( 5 ) 

( 6 ) 



In fact, each interval [a;i,a: 2 ] does again define a binary attribute X^-i^x 2 = 
I[xi^x 2 ]- Thus, not only the rating but also the mining of quantitative rules can 
be reduced to the mining of binary association rules, by simply transforming the 
numerical data into binary data [18,22]. Still, finding a useful transformation 
(binarization) of the data is a non-trivial problem by itself which affects both, the 
efficiency of subsequently applied mining algorithms and the potential quality of 
discovered rules. Apart from data transformation methods, clustering techniques 
can be applied which create intervals and rules at the same time [16, 24]. 

^ Subsequently we focus on support and confidence measures. The results can be trans- 
ferred to other measures such as interest in a straightforward way. 
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2.3 Fuzzy Association Rules 

Replacing the sets (intervals) A and B in (4) by fuzzy sets (intervals) leads to 
fuzzy (quantitative) association rules. Thus, a fuzzy association rule is under- 
stood as a rule of the form A=> B, where A and B are now fuzzy subsets rather 
than crisp subsets of the domains Dx and £>y of variables X and Y , respec- 
tively. We shall use the same notation for ordinary sets and fuzzy sets. Moreover, 
we shall not distinguish between a fuzzy set and its membership function, that 
is, A(x) denotes the degree of membership of the element x in the fuzzy set A. 
Note that an ordinary set A can be considered as a “degenerate” fuzzy set with 
membership degrees A{x) = Ia( 2 ;) G {0, 1}. 



3 Quality Measures for Fuzzy Association Rules 



The standard approach to generalizing the quality measures for fuzzy association 
rules is to replace set-theoretic by fuzzy set-theoretic operations. The Cartesian 
product Ax B oi two fuzzy sets A and B is usually defined by the membership 
function {x^y) i-^- min{A(a;), R(y)}. Moreover, the cardinality of a finite fuzzy 
set is simply the sum of its membership degrees [17]. Thus, (5) and (6) can be 
generalized as follows: 



supp(A ^ B) = E min{A{x),B{y)}, 

{x,y)^D 



conf(A B) 



^ E(x,y)gDmin{A(a:),R(y)} 
J2(x,y)eD 



(7) 

(8) 



Note that the support of A ^ B corresponds to the sum of the individual 
supports, provided by tuples (x,y) G D-? 



supp[,,,y](2l ^ B)= min {A{x),B{y)} . (9) 

According to (9), {x, y) supports A R if both, x G A and y G B. 



3.1 Support 

The fact that the antecedent A and the consequent B play symmetrical roles in 
(9) might appear strange. Indeed, a more logic-oriented approach to modeling a 
fuzzy rule “If A is A then Y is R” would use a generalized implication operator 
i.e. a mapping [0, 1] x [0, 1] ^ [0, 1] which generalizes the classical material 
implication (particularly, is non-increasing in the first and non-decreasing in 
the second argument). Thus, individual support can be defined as 

B) = A{x) ^ B{y) (10) 

^ See [12] for an alternative approach where the freqnency of a fuzzy item is measured 
by a fuzzy cardinality, i.e. by a fuzzy (rather than by a crisp) number. 
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and, hence, the overall (now asymmetric) support as 

supp(zl B) = ^ B{y). (11) 

(x,y)&D 

In order to realize the difference between (9) and (10), consider a simple rule 
“A ^ B: If X is approximately 10 then Y is almost 0” defined by two fuzzy 
subsets of the non-negative integers, 

l_po^ if 6 <3- <14 [l-f if0<a;<4 

° , B : y i-r < 

0 otherwise ( 0 otherwise 

To which degree does the tuple {x, y) = (8, 3) support the above rule? According 
to (9), the support is 2/5, namely the minimum of the membership of 8 in A and 
the membership of 3 in i?. According to (10) with the Goguen implication 

ft — I ^ ifa = 0 

(min{l,/3/a} if a>0 

the individual support is larger, namely 2/3. In fact, {x,y) = (8,3) does hardly 
violate (and hence supports) the rule in the sense of (10): It is true that y = 3 
does not fully satisfy the conclusion part of the rule; however, since a; = 8 does 
not fully meet the condition part either, it is actually not expected to do so. 



A: 




Fig. 1. A simple functional relation between two variables 
that can be described by means of a gradual fuzzy rule. 



As can be seen, the definition of adequate quality measures for fuzzy as- 
sociation rules depends strongly on the interpretation of the rule.^ For further 
illustration, consider the nine observations shown in Figure 1.^ In the sense of 
(10), each of these observations does fully support the rule “If X is approximately 
10 then Y is almost 0.” In fact, this rule is actually interpreted as “The closer X 

® Particularly, the strategy of first using (7) and (8) to find interesting fuzzy associa- 
tions and then interpreting these rnles as implications appears qnestionable [4]. 

^ Needless to say, this is a somewhat artificial example not at all typical of data mining. 
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is to 10, the closer V is to 0” or, more precisely, “The closer X is to 10, the more 
it is guaranteed that V is close to 0.” Therefore, (11) yields supp(A B) = 9. 
As opposed to this, the overall support is only 5 according to (7), since the indi- 
vidual support min{A(a;), which comes from a point (x,y) is bounded by 

B(y), the closeness of y to 0. For instance, the support through (6,4) is only 1/5 
rather than 1. Note that, in the sense of (11), the support of the rule A S is 
larger than the support of B A! This reflects the fact that the closeness of x 
to 10 is less guaranteed by the closeness of y to 0 than vice versa. For instance, 
y = 2 is “rather” close to 0, whereas a; = 14 is only “more or less” close to 10, 
hence suppp_i4] (B A) = 1/3 < 1 = supp[;^4 2j(A B). 

The above example shows that the implication-based approach should be 
preferred whenever an association rule is thought of as expressing a gradual “the 
more ... the more ...’’-relation between variables and all the more if this relation 
is not symmetric. For instance, the rule “Young people have low income” might 
actually be understood as “The younger a person, the lower the income,” and 
this rule might well be distinguished from its inversion “The lower the income, 
the younger the person.” 

Concerning the adequacy of (11) as a support measure for association rules, 
two points deserve mentioning. Firstly, the concept of support in the context of 
association rules is actually intended as non-trivial support. Yet, the truth degree 
of a (generalized) implication a /3 is 1 whenever a = 0: From a logical point 
of view, a false premise entails any conclusion. That is, the rule A ^ B would 
also be supported by those points {x, y) for which a: ^ A. In order to avoid this 
effect, (10) can be modified as follows: 



SUPP[,r,y](^^ B) 



A{x) B{y) if A{x) > 0 
0 if A(x) = 0 



(12) 



According to (12), a point (x, y) supports a rule if both, it satisfies the rule (from 
a logical point of view) and it is non-trivial. Here, non-triviality is considered as a 
binary concept. However, it can also be quantified as a gradual property, namely 
as the degree to which x is in A. Combining satisfaction and non-triviality by 
means of a generalized logical conjunction T, a so-called t-norm, then yields 
suppj„, j^](A ^ B) = T(A(cc), A(x) B{y)). For example, by using the product 
operator as a special t-norm we obtain 



suppp,y](A ^ B)= A{x) ■ (A{x) B{y)). (13) 

Note that (9), (12), and (13) are identical in the case where A and B are intervals, 
that is where A{x),B{y) G {0, 1}. 

The second point concerns the choice of the implication operator In fact, 
different types of implication operators exist which support different interpreta- 
tions of a fuzzy rule [11]. The gradual “the more ... the more. ..’’-interpretation 
discussed above is supported by so-called R(esiduated)-implications. An impli- 
cation of this type can be derived from a t-norm T by residuation (hence the 
name): 



a (3 = sup{7 G [0, 1] I T {a, 7) < /?}. 
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A second important class is given by so-called S(trong)-implications, which are 
defined by a /3 = n(ct) 0 (3, where 0 is a t-conorm (a generalized disjunction) 
and n a strong (hence the name) negation. For example, taking n(-) = 1 — (•) and 
0 = max, one obtains the Kleene-Dienes implication a (3 = max{l — a, /?}. 

S-implications support a different type of fuzzy rule, often referred to as 
certainty rules. Basically, they attach a level of uncertainty to the conclusion 
part of the rule, in correspondence with the truth degree of the condition part. 
These rules, however, appear less reasonable in the context of association rules. 
This can be exemplified by the Kleene-Dienes implication. For this operator, the 
truth degree is lower-hounded by 1 — A{x). Thus, (13) entails suppj,^, ,^](A ^ B) > 
min{A(a;), 1 — A{x)} regardless of the value B{y). For example, if A{x) = 1/2, 
then suppj,„ j^](A ^ B) = 1/2, no matter whether B{y) is 1/2, 1/4, or even 0. 
This contrasts with R-implications, for which (3 = Q generally implies a = 0. 



3.2 Confidence 



A measure of confidence of a fuzzy association rule can be derived from a cor- 
responding measure of support. Indeed, in the non-fuzzy case, the confidence of 
A i? is nothing else than the support of A R over the support of A, that 
is, the support of A =4» Sy. Interestingly enough, however, the minimal confi- 
dence condition conf(A ^ B) > A (where Z\ is a user-specified threshold) can 
be interpreted in different ways, which in turn suggest different generalizations. 

According to the aforementioned interpretation which relates the support of 
A R to the support of A, one obtains the generalized confidence measure 



E0,y)eDSupP[,,j,](A=^Dv) 



(14) 



Note that A{x) Sv(y) = A{x) 1 = 1 for all {x,y). Thus, the denominator 
in (14) simplifies to ^(o.i] (^(a^)) (12) and 'E(x,v)&dMx) for (13). 

A second possibility is to relate the support of A R to the support of 
A ^B. In this case, the minimal confidence condition means that the rule 
A ^ B should be supported much better than A ^ ^B: 



^ B) = 

Y.(x,v)eD^^PP[x,vM^ ^B) 



(15) 



Note that A{x) = suppj,„ ,^](A =4> £>y) = suppj,,, ,^](A R) 0 suppj,,, ,^](A ~^B) 

for all {x, y) in the non-fuzzy case, which means that (14) and (15) are equivalent 
in the sense that one criterion can mimic the other one by adapting its threshold: 



supp(A B) 
supp(A S)y) 



supp(A R) ^ A 
supp(A ^B) ~ 1 — A 



>A^ 
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4 Semantic Interpretation of Fuzzy Association Rules 

In this section, we propose a semantic model of implication-based fuzzy rules 
which can directly be applied to fuzzy association rules. The idea is to represent 
a fuzzy rule as a collection of crisp (implication-based) rules. According to this 
model, a fuzzy association rule can be considered as a convex combination of 
non-fuzzy association rules. In this connection, we shall also justify the support 
measures (12) and (13). 



4.1 Pure Gradual Rules 



Consider two variables X and Y ranging on domains Sjf and respectively. 
Moreover, let A and B denote fuzzy subsets of Sx and Sy. For the sake of 
simplicity, we assume the range of A and i? to be a finite subset C C [0, 1]. That 
is, membership degrees A{x) and B{y) are elements of £ = {Ai, . . . , A„}, where 
0 = Ai < A 2 < . . . < A„ = 1. 

A special type of fuzzy rule, called pure gradual rule [10], is obtained for the 
Rescher-Gaines implication 



1 if a < (3 

0 if a > P 



(16) 



A pure gradual rule does actually induce a crisp relation of admissible tuples 
{x,y). In fact, the fuzzy rule A^ B, modeled by the implication (16), is equiv- 
alent to the following class of non-fuzzy constraints: 



X e Ax^Y e Bx (A e £) (17) 

where Ax = {x \ A{x) > A} is the A-cut of A. Now, in some situations one 
might wish to modify the constraints (17), that is to weaken or to strengthen a 
conclusion Y G Bx drawn from the condition X G Ax ■ This leads to a collection 



XgAx^Yg R„(a) (A e £) (18) 

of (non-fuzzy) constraints, where m is a mapping C ^ C. These constraints can 
be written compactly in terms of membership functions as m{A(X)) < B(Y), 
and correspond to the rule A ^ B modeled by the modified Rescher-Gaines 
implication with a P = 1 of m(a) < P and 0 otherwise. Given two fuzzy 
sets A and R, we can thus associate a gradual rule A -^ra B (which is short for: 
A ^ B modeled by the implication -^m) with each function m \ C ^ C. Note 
that m should be non-decreasing: If the premise X € Ax entails the conclusion 
Y G then a more restrictive premise X G Ax' (A < A') justifies this 

conclusion all the more, that is m(A) < m(A'). Thus, the scale C gives rise to 
the following class of gradual rules: 

G = Ga,b = {A ^rn B\m : C ^ C is non-decreasing}. (19) 
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4.2 Other Implication-Based Rules 



The tuples {x, y) that satify a pure gradual rule A B define an ordinary 
relation C Dx x S)y: 



1 if m{A{x)) < B{y) 
0 if m{A{x)) > B{y) 



( 20 ) 



More generally, an implication operator induces a fuzzy relation where 
7T^{x,y) = A{x) B{y) is the degree of admissibility of (x,y). Subsequently, 
we assume a multiple- valued implication to be non-increasing in the first and 

non-decreasing in the second argument, and to satisfy the identity property: 
A 1 = 1 for all A G £. 



4.3 Randomized Gradual Rules 

We are now going to establish a helpful relationship between an implication- 
based rule A B, i.e. the rule A ^ B modeled by the implication operator 
and the class (19) of pure gradual rules A -^ra B associated with A and B. 

Definition 1 (randomized rule). A randomized rule associated with a condi- 
tional statement “If X is A then Y is B” is a tuple (G,p), where Q = Ga,b is 
the (finite) set of pure gradual rules (19) and p is a probability distribution on 
G- Each rule A B is identified by the corresponding function m : C ^ C. 
Moreover, Pm = p{A B) is interpreted as the probability (or, more generally, 
the weight) of the rule A ^ra B. 

Recall that each pure gradual rule A B induces an admissible set (20) of 
tuples {x,y). Therefore, a randomized rule {G,p) gives rise to a random set over 
T>x X Sy and, hence, induces the following fuzzy relation: 

~ ^ ' Pm ' T^m- (21) 

Moreover, (21) is completely determined by the following implication operator 
associated with (G,p)' 



, (6.P) 

Ai 



A,= 



Prri’ 






(22) 



Namely, 7T(g p)(x, y) = A(x) B{y) for all (x,y) G S)x x Sy- 

Recall that a pure gradual rule corresponds to a collection of simple, non- 
fuzzy constraints and, hence, disposes of a very simple semantics. Since a random 
rule is a convex combination of pure gradual rules, it can also be interpreted in 
a very simple way. This lets the representation of a general implication-based 
fuzzy rule in terms of a random rule seem appealing. Concerning this represen- 
tation, we have proved the following existence and uniqueness results [9]: For 
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each implication operator a probability p exists such that the rule A is 
equivalent to the randomized rule (G,p) in the sense that tt^ = T^(g,p)- That is, 
the rule A B and the randomized rule (G,p) induce the same admissibility re- 
lation on T>x X • Moreover, the probability p is guaranteed to be unique if the 
implication operator does not have a certain (strict) monotonicity property. 

Theorem 1. For each fuzzy rule B formalized by means of an implication 
operator an equivalent random rule (G,p) exists. Moreover, the representation 
in terms of (G,p) is unique if the condition < hj < hi)y^{lkj < lu < Iti) 

holds for all 1 <i <k <n and 1 < j < I < I, where 7 ^^ = \ \j. 

4.4 Application to Association Rules 

In Section 2, we have proposed to consider a fuzzy association A => B as dm 
implication-based (gradual) fuzzy rule. Referring to the above interpretation, a 
rule A ^ B can hence be seen as a convex combination of simple or pure gradual 
association rules 



A — B (to G G), (23) 

weighted by the probability degrees Pm- Each of these gradual association rules 
in turn corresponds to a collection 

^ Bm(\) (A e C) (24) 

of ordinary association rules. In fact, if the level-cuts A\ and Bm{\) of the fuzzy 
sets A and B are intervals, which holds true for commonly used membership 
functions, then (24) reduces to a class of interval-based association rules. 

The interpretation as a randomized rule assigns an association rule a concrete 
meaning and might hence be helpful in connection with the acquisition (mining) 
and interpretation of such rules. Apart from this, it provides a basis for justifying 
quality measures for fuzzy association rules. In fact, proceeding from the convex 
combination of rules (23) it is obvious to define the support of A R as the 
convex combination of the supports of the rules A ->-rn B: 

supp(A ^ B) = E Pm,-supp{A^^ B). (25) 

m^G 

Thus, it remains to define the (non-trivial) support of a pure gradual association 
rule A -^ra B, that is of a collection of ordinary association rules (24). To which 
degree does a point {x, y) support this class of constraints? One possibility is to 
say that (x, y) supports A B if it satisfies all individual constraints, and, 
furthermore, at least one of these constraints is non-trivial: 



su 



PP[a:,y](A^ ’'rn B) — 



1 if A{x) > 0 A m{A{x)) < B{y) 
0 otherwise 



(26) 
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A second possibility is to define supp[^ ,^](A B) as the sum of weights of 
those individual constraints which are indeed non-trivially satisfied: 



SUPP[,;,y](A ^rn B) 



A{x) if m{A{x)) < B{y) 
0 otherwise 



(27) 



It is readily verified that (26), in conjunction with (25), yields the support mea- 
sure (12), and that (27) in place of (26) implies (13). This result provides a sound 
basis for these measures of (individual) support and, hence, for further quality 
measures derived from them. 



5 Concluding Remarks 

We have proposed an implication-based approach to fuzzy association rules as 
well as a semantic model which suggests of consider such rules as a convex 
combination of simple, non-fuzzy association rules. Thus, a fuzzy association 
can be seen as a compact representation of a class of simple rules. This model 
clarifies the meaning and provides a sound basis of fuzzy association rules. 

The paper has mainly focused on theoretical foundations of fuzzy association 
rules. An important aspect of ongoing research is the practical realization of 
the results, that is the development of rule mining procedures. Our current 
implementation (not presented here due to space limitations, see [14]) is an 
extension of the Apriori algorithm [1] which is able to cope with fuzzy attribute 
values and asymmetric support measures. This algorithm takes advantage of the 
fact that the support (13) of A B is lower-bounded by the support of the 
premise A: supp[,„ ,^](A ^ B) = A{x) ■ (A{x) B{y)) < A{x). Consequently, the 

premise A of a minimally supported rule A ^ B must be a frequent itemset or, 
put in a different way, the frequent itemsets (which can be found by Apriori) 
constitute a superset of the condition parts of minimally supported association 
rules. Furthermore, the algorithm makes use of a monotonicity property for 
implications which is similar to the monotonicity property of frequent itemsets 
employed by Apriori: supp(A ^ B) < supp(A B') for all B' c B. Thus, if 
A ^ B satisfies the minimum support condition, the same condition holds for 
each rule A ^ B' with B' C B. This provides the basis for filtering candidate 
rules (obtained by combining frequent itemsets A with conclusions B) in an 
efficient way. 
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Abstract. The paper presents a new general measure of rule interest- 
ingness. Many known measures such as chi-square, gini gain or entropy 
gain can be obtained from this measure by setting some numerical pa- 
rameters, including the amount of trust we have in the estimation of the 
probability distribution of the data. Moreover, we show that there is a 
continuum of measures having chi-square, Gini gain and entropy gain 
as boundary cases. Therefore our measure generalizes both conditional 
and unconditional classical measures of interestingness. Properties and 
experimental evaluation of the new measure are also presented. 

Keywords: interestingness measure, distribution, Kullback-Leibler di- 
vergence, Cziszar divergence, rule. 



1 Introduction 

Determining the interestingness of rules is an important data mining problem 
since many data mining algorithms produce enormous amounts of rules, mak- 
ing it difficult for the user to analyze them manually. Thus, it is important to 
establish some numerical interestingness measure for rules, which can help users 
to sort the discovered rules. A survey of such measures can be found in p. Here 
we concentrate on measures that assess how much knowledge we gain about the 
joint distribution of a set of attributes Q by knowing the joint distribution of 
some set of attributes P. Examples of such measures are entropy gain, mutual 
information, Gini gain, The rules considered here are differ- 

ent from classical association rules studied in data mining, since we consider full 
joint distributions of both antecedent and consequent, while association rules 
consider only the probability of all attributes having some specified value. This 
approach has the advantage of applicability to mulit valued attributes. 

We show that all the above mentioned measures are special cases of a more 
general parametric measure of interestingness, and by varying two parameters, 
a family of measures can be obtained containing several well-known classical 
measures as special cases. 

There is work done in machine learning and information theory literature P 
on generalizing information-theoretical measures. However, all previous 
work is concerned with either unconditional or conditional measures, while this 
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paper presents a generalization which includes family of intermediate measures, 
between conditional and unconditional ones, and shows a relation between these 
measures and the amount of trust we have in the estimate of probabilities from 
data. For example, we present a continuum of measures between (uncon- 
ditional measure), and the Gini gain (conditional measure). We show that the 
intermediate measures have many interesting properties which make them useful 
for rule evaluation. 

Next, we give some essential definitions. 

Definition 1. A probability distribution is a matrix of the form 



A ' ' ' ^m\ 

~ \Pl ■■■ Pm) ’ 

where Pi > 0 for 1 < i <m and Y^'^iPi = 1. 

A is an uniform distribution if Pi = ■ ■ ■ = Pm = m-valued uniform 

distribution will be denoted byUm- 

Let T = (T, iL, p) be a database table, where T is the name of the table, H is 
its heading, and p is its content. If A G iL is an attribute of r, the domain of A 
in r is denoted by dom(A). The projection of a tuple t G p on a set of attributes 
L C H is denoted by t[L], For more on relational notation and terminology 

see PS]. 



Definition 2. The distribution of a set of attributes L = {Ai, . . . , A„} is the 
matrix 



Al.t = 




( 1 ) 



where r = 0^=1 Mom(Aj)|, ii G dom(Ai) x • • • x dom(A„), and pi 
for 1 < i < r. 



\t^p\t[L]=lj\ 

IpI 



The subscript r will be omitted when the table r is clear from context. 

The Havrda- Char vat a-entropy of the attribute set L (see |Z|) is defined as: 



H„(L) 



1 

1 — a 




The limit case, when a tends towards 1 yields the Shannon entropy 'H(L) = 
— logPj. Another important case, the Gini index, is obtained when a = 

2 (see P) and is given by gini(L) = 1 — 

If L, K are two sets of attributes of a table r that have the distributions 



Ar = 



fh ■ 


\ 


, and Ak = 


■ kn\ 


\P1-' 


■Pm) 


\qi ■ ■ 


■ Qn) 
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then the conditional Shannon entropy of L conditioned upon K is given by 

m n 

n{L\K) = -EE Py log— , 

i—1 j — 1 

where pij = tf-R'l-fejll for 1 < z < m and 1 < j < n. Similarly, 

the Gini conditional index of these distributions is: 

m n 2 

gini(L|Ar) = 1 - EE^- 

i=l j=l 

These definitions allow us to introduce the Shannon gain (called entropy gain in 
literature P|) and the Gini gain defined as: 

gairigi,,i(L,it:) = gini(L) - gini(L|Ar), 

= 'H{L)+'H{K)-n{LK), ( 2 ) 



respectively, where LK is an abbreviation for LU K. Note that the Shannon 
gain is identical to the mutual information between attribute sets P and Q jOj. 
For the Gini gain we can write: 

m n 2 

gaingi„i(L,Ar) = EE^“E^^* (3) 

i=l j = l i=l 

The product of the distributions Z\p, Aq, where 



Ap = 



fxi ■■■ Xm\ 
\Pl ■■■ Pm) 



and Aq = 



fyi--- 2/n\ 

V9l • • ■ Qn) 



is the distribution 



Ap X Aq = 



( {xi,yi) ••• (a;m,2/n)\ 

V Pl9l ••• Pm^n ) 



The attribute sets P, Q are independent if A pq = Ap x Aq . 

Definition 3. A rule is a pair of attribute sets (P,Q). If P,Q C H, where 
T = (T,H,p) is a table, then we refer to (P,Q) as a rule of t. 

If {P-i Q) rule, then we refer to P as the antecedent and to Q as the 
consequent of the rule. A rule (P, Q) will be denoted, following the prevalent 
convention in the literature, by P ^ Q . 

This broader definition of rules originates in j^, where rules were replaced by 
dependencies in order to capture statistical dependence in both the presence and 
absence of items in itemsets. The significance of this dependence was measured 
by the test, and our approach is a further extension of that point of view. 
The notion of distribution divergence is central to the rest of the paper. 
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Definition 4. Let T> be the class of distributions. A distribution divergence is 
a function D : T> x T> — > R such that: 

1. D{A, A') > 0 and D{A, A') = 0 if and only if A = A' for every A, Z\' G V. 

2. When A' is fixed, D{A, A') is a convex function of A; in other words, if 

A = aiZ\i + • • • + OkAk, where oi + . . . + = 1, then 

k 

D{A,A')>Y,a,D{A,,A'). 

An important class of distribution divergences was obtained by Cziszar in ^ 

as: D^{A,A') = J27=i , where 



II 


* kn\ 


, and A' = 


■ lu\ 


\pi ■■ 


* Pn) 


\qi ■ ■ 


■ Qn) 



are two distributions and (j) : R — > R is a twice differentiable convex function 
such that (^(1) = 0. We will also make an additional assumption that 0-(^(^) = 0 
to handle the case when for some i both pi and Qi are zero. If for some i, pi > 0, 
and Qi = 0 the value of D,p{A, A') is undefined. 

The Cziszar divergence satisfies properties (1) and (2) given above (see I?)). 
The following result shows the invariance of Cziszar divergence with respect 
to distribution product: 

Theorem 1. For any distributions F, A, A' and any Cziszar divergence measure 
we have D^{F x A, F x A') = D^{A, A'). 

Depending on the choice of the function (j) we obtain the divergences shown 
in the table below: 



(j){x) 


D{A,A') 


Divergence 


xlogx 


log ^ 


Kullback-Leibler 


9 

X — X 


P? 1 

= l Qi ^ 





Both the Kullback-Leibler divergence (also known as cross entropy), which we 
will denote by Dkl and the x^-divergence denoted by D ^2 are special cases of 
the Havrda-Charvat divergence generated by 4>{x) = HI; specifically, 

D ^2 is obtained by taking a = 2, while Dkl is obtained as a limit case, when a 
tends towards 1. 

It is easy to verify that 

Note that \p\D ^2 equals the dependency measure, well known from statistics 
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2 Interestingness of Rules 

The main goal of this paper is to present a unified approach to the notion of 
interestingness of rules. Let r = P ^ Q he a. rule in a table r = {T,H,p). To 
construct an interestingness measure we will use a Bayesian approach, in that 
we will consider an a posteriori distribution 0 of the consequent set of attributes 



The definition of an interestingness measure of r will be guided by two main 
considerations: 

— The more the observed joint distribution of PQ diverges from the product 
distribution of P and the a posteriori distribution 0 of Q the more interesting 
the rule is. Note that Apq = Ap x 0 corresponds to the situation when P 
and Q are independent and the observed distribution of Q is identical to the 
a posteriori distribution. 

— The rule is not interesting if P, Q are independent. Therefore, we need to 
consider a correcting term in the definition of an interestingness measure that 
will decrease its value when Aq is different from the a posteriori distribution. 

The choice of the distribution 0 of the consequent Q of rules of the form 
P ^ Q can be made starting either from the content of the table, that is, 
adopting Aq for 0, or from some exterior information. For example, if Q is the 
sex attribute for a table that contains data concerning some experiment subjects, 
we can adopt as the a posteriori distribution either 



assuming that 45% of the individuals involved are female, or the distribution 



consistent with the general distribution of the sexes in the general population. 

Moreover, we can use the Laplace estimator mm (also known in literature 
as the m-estimate of probability) to obtain the a posteriori distribution 

^ _ \p\AQ + m0o 

\p\+m ’ 

where Aq is the distribution of Q that is extracted from a table r, 0 q is the 
apriori distribution, \p\ is the size of the database, and the integer m represents 
the amount of trust we have in the prior distribution 0g. If m = 0, this means 
we completely ignore the a priori distribution, and m — >■ oo means that we have 
no trust in the data, and totaly rely on the prior distribution. To avoid using 
limits, we denote a = and write the Laplacian as a convex combination 

of the two distributions: 



Q. 




’F’ ’M’ 
0.45 0.55 





0a — CiAq + (1 — a)0Q. 

Now 0=1, and a = 0 correspond to cases m = 0, and to — )> oo respectively. 
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Definition 5. Let r : P ^ Q he a rule, D he some measure of divergence 
between distributions, and let O he a distribution. 

The measure of interestingness generated by D and 0 is defined by 



'^D.ei'r) = D{ApQ,Ap x 6>) - D{Aq,0). 

In the above definition 0 represents the a posteriori distribution of Q, while Aq 
is the distribution of Q observed from the data. The term D{Aq, 0) measures the 
degree to which Aq diverges from the prior distribution 0, and D{Apq, Ap x 0) 
measures how far Apq diverges from the joint distribution of P and Q in case 
they were independent, and Q was distributed according to 0. 

The justification for the correcting term D{Aq,0) is given in the following 
theorem: 

Theorem 2. If P and Q are independent, and D is a Cziszar measure of di- 
vergence then Tp) 0 {P — >■ Q) = 0. 

Observe that if D is a Cziszar divergence D = D^, then the invariance of 
these divergences implies: 



'^D 4 ,,e{P Q) = D^{ApQ, Ap X 0) — D,p{Ap x Ag,Ap x 0 ). 

3 Properties of the General Measure of Interestingness 

Initially, we discuss several basic properties of the proposed measure. 

Theorem 3. If D is a Cziszar divergence, then 

Pd^Aq^P Q) = Po.ApiQ P) 

The above property means that when the a posteriori distribution of the 
consequent is always assumed equal to the distribution observed from data, then 
the measure is symmetric with respect to the direction of the rule, i.e. exchanging 
the antecedent and consequent does not change the value of the interestingness. 

Theorem 4. Let D he a Cziszar divergence. If R is a set of attributes inde- 
pendent of P, and jointly of PQ, then, for any 0 we have Tp o^RP -A- Q) = 

Pd, 0 {P Q)- 

If R is a set of attributes independent of Q, and jointly of PQ, then 

Pd,Ahq{P — >■ RQ) = Pd,Aq{.P — >■ Q)- 

The previous result gives a desirable property of since adding indepen- 
dent attributes should not affect rule’s interestingness. In particular, when 0 
equals the observed distribution of the consequent, then T is not affected by 
adding independent attributes to either the antecedent or the consequent. 

Next, we consider several important special cases of the interestingness mea- 
sure. 

If the divergence D and the a priori distribution used in the definition of 
the interestingness measure are chosen appropriately, then the interestingness 
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'^D,e{P — >■ Q) is proportional to a gain of the set of attributes of the consequent 
Q of the rule relative to the antecedent P. Both the Gini gain, gairig;jj;(Q, P), 
and the entropy gain, P), can be obtained by appropriate choice 

of D. Moreover a measure proportional to the statistic can be obtained in 
that way. 

Suppose that the attribute sets P, Q have the distributions 



Ap = 



‘ ‘ * 

Vl ■■■ Pm 



, and Aq = 



2/1 •• • Vn 
qi ■■■ Qn 



Let pij = {t G p\t[P] = Xi and t[Q] = pj} and let pij = for 1 < i < m and 
1 < j < n. 



Theorem 5. Let P — > Q be a rule in the table r = (P, H, p). If D = Pkl then 

PdAP ~^Q)= gainshan„on(<3> P)^ 



regardless of the choice ofO. 

The above theorem means that for the case Pkl the family of measures gen- 
erated by 0 reduces to a single measure: the Shannon gain (mutual information). 
This is not the case for other divergences. 

Theorem 6. Let P — > Q be a rule in the table r = (T, H, p). If D = D ^2 and 
0 = Un, where n = |dom(Q)|, then 

PdAP Q) = n- gairigij,i(Q, P). 



Theorem 7. We have To 2 ,Aq{P — >> Q) is proportional to x^{PA)> the chi- 
squared statistics m for attribute sets P, Q. 

Note that above we treat attribute sets P = {Ai,... Ar} and Q = 
{Bi, . . . , Pg} as single attributes with the domains given by ifQ). This is appropri- 
ate, since we are interested in how one set of attributes P influences another set of 
attributes Q. Another way, used in P, is to compute ■ AriBi,... ,Bs), 

however this is not what we want. 

The case when D — D ^2 is of practical interest since it includes two widely 
used measures (y^, and gairigj^j) as special cases, and allows for obtaining a 
continuum of measures “in between” the two. 

Theorem 0 stated below shows that the generalized measure interestingness 
Po.eiP Q) is minimal when P and Q are independent and thus, it justifies 
our definition of this measure through variational considerations. 

Theorem 8. Let Tp, q be the measure of interestingness generated by the a pos- 
teriori distribution 0 and the Kullback-Leibler divergence, or the -divergence 
and let P ^ Q be a rule. For any fixed attribute distribution Ap, Aq and a fixed 
distribution 0, the value ofTp e{P — >■ Q) is minimal (and equal to 0) if only if 
ApQ = Ap X Aq, i.e., when P and Q are independent. 
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We saw that gain^hannon and gaingj^^j are equivalent to and Td^ 2 .u,^, 

respectively. It is thus natural to define a notion of gain for any divergence D as 



gain^(P — ^ Q) — Td^u^{P — >■ Q). 



Let Aq \ pi denote the probability distribution of Q conditioned on P = pi. For 
any Cziszar measure D,p we have: 



= D^{ApQ,Ap xUn) - 



III IL - 




D,p{AQ,Un) 

- D^{AQ,Un) 



D^{AQ,Un) 



- ^PiD^{AQ\p^,Un) 
i=l 



As special cases gairigi^; = gain^s, and gain^hannon = gairiKL- 

A parameterized version of T that takes into account the degree of confidence 
in the distribution of the consequent as it results from the data is introduced 
next. 

Let us define the probability distribution 0a, a € [0, 1] by 
0a = aAg + (1 - a)Un- 

The value of a expresses the amount of confidence we have in Zig estimated from 
the data. The value a = 1 means total confidence, we assume the probability 
estimated from data as the true probability distribution of Q. On the other 
hand, a = 0 means that we have no confidence in the estimate and use some 
prior distribution of Q instead. In our case, the prior is the uniform distribution 
Un- Note that 6>i = Aq, and 0q —Un- 
We can now define 



PD,a = P'D,0a- 

Note that when D = D^ 2 , we have (up to a constant factor) both y^{P — >■ Q) 
and ginigg|„(P — >• Q) as special cases of Tp^^^a- Moreover by taking different 
values of parameter a we can obtain a continuum of measures in between the 
two. 

As noted before, both D ^2 and Pkl divergence measures are special cases of 
Havrda-Charvat divergence for a — >■ 1, and a = 2 respectively. We can thus 
introduce = Po-HcPay which allows us to obtain a family of interestingness 
measures, including (up to a constant factor) all three measures given in Sect. 0 
as special cases, by simply changing two real valued parameters a and a. 

Also note that for a = 0, we obtain a family of gains (as defined in Sect. 0 
for all the Havrda-Charvat divergences. 
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4 Experimental Results 

We evaluated the new measure on a simple synthetic dataset and on data from 
the UCI machine learning repository 0. We concentrated on the case D — D^ 2 , 
as potentially most useful in practice, and found interestingness of rules for 
different values of parameter a (see Sect. 0. 

To ensure measures throughout the family handle obvious cases correctly, 
and to make it easy to observe properties of the measure for different values of 
parameter a we first evaluated the rules on a synthetic dataset with 3 attributes 
A, B,C and with known probabilistic dependencies between them. 

Values of attributes A and B have been generated from known probability 
distributions: 



= 



0 12 

0.1 0.5 0.4 



Ab = 



0 1 
0.2 0.8 



Attribute C depends on attribute A. Denote Ac\i the distribution of C condi- 
tioned upon A = i. We used 




One million data points have been generated according to this distribution, 
for a few values of a we sorted all possible rules based on their Tb 2 , a interest- 
ingness values. Results are given in Tabled 



1. Attribute B is totally independent of both A and C, so any rule containing 
only B as the antecedent or consequent should have interestingness 0. The 
experiments confirm this, for all values of parameter a such rules have inter- 
estingness close to zero, significantly lower than the interestingness of any 
other rules. 

2. For a = 0 (the first quarter of the table) T becomes the Gini gain, a measure 
that is strongly asymmetric (and could thus suggest the direction of the 
dependence) and strongly affected by adding extra independent attributes 
to the consequent (which is undesirable). 

3. For 0=1 (the last quarter of the table) T becomes (up to a constant 
factor) the measure of dependence. This measure is totally symmetric 
and not affected by presence of independent attributes in either antecedent 
or consequent. Indeed, it can be seen that all rules involving A and C have 
the same interestingness regardless of the presence of B in the antecedent or 
consequent. 

4. As a varies from 0 to 1 the intermediate measures can be seen to become 
more and more symmetric. Measures for a being close to but less than 1 
could be of practical interest since they seem to ‘combine the best of the 
two worlds’, that is, are still asymmetric and pretty insensitive to presence 
of independent attributes in the consequent. E.g. for a = 0.9 all rules having 
A in the antecedent and C in the consequent have interestingness close to 
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Table 1. Rules on synthetic data ordered by To^ 2 ,a for different values of a 



rule 


O 

X 


rule 


Fd,^2.0.5 


A^BC 


0.122061 


A-^BC 


0.0989161 


C^AB 


0.0896776 


AB^C 


0.0898611 


AB^C 


0.0896287 


A^C 


0.089861 


A^C 


0.0896287 


C^AB 


0.0769886 


BC^A 


0.065851 


BC^A 


0.0683164 


C^A 


0.0658484 


C^A 


0.0683142 


B^AC 


3.16585e-06 


B^AC 


2.50502e-06 


B^A 


2.7369e-06 


B^A 


2.35091e-06 


AC^B 


1.37659e-06 


AC^B 


1.51849e-06 


A^B 


1.32828e-06 


A^B 


1.46355e-06 


B^C 


1.70346e-07 


B^C 


1.72781e-07 


C^B 


1.10069e-07 


C^B 


1.22814e-07 


rule 


Fo^2.0.9 


rule 


Fd 2,1 


A^BC 


0.0908769 


BC^A 


0.0905673 


AB^C 


0.0903859 


A^BC 


0.0905673 


A-^C 


0.0903859 


C^AB 


0.0905654 


C^AB 


0.0834734 


AB^C 


0.0905654 


BC^A 


0.082009 


A-^C 


0.0905653 


C^A 


0.082007 


C^A 


0.0905653 


B^AC 


2.19739e-06 


AC^B 


2.15872e-06 


B^A 


2.12646e-06 


B^AC 


2.15872e-06 


AC^B 


1.95101e-06 


A^B 


2.08117e-06 


A^B 


1.87986e-06 


B^A 


2.08017e-06 


B^C 


1.73782e-07 


C^B 


1.74126e-07 


C^B 


1.57306e-07 


B^C 


1.74126e-07 



0.09, while rules having C in the antecedent and A in the consequent have 
all interestingness close to 0.082 regardless of the presence or absence of B 
in the consequents. So for a = 0.9 the intermediate measure correctly ranked 
the rules indicating the true direction of the relationship. 

We then repeated the above experiment on data from the UCI machine learn- 
ing repository j2j. Here we present results for the agaricus-lepiota database con- 
taining data on North American Mushrooms. To make the ruleset size manage- 
able we restrict ourselves to rules involving the class attribute indicating whether 
the mushroom is edible or poisonous. 

In the experiment we enumerated all rules involving up to 3 attributes and 
ranked them by interestingness for different values of parameter a. Top ten rules 
for each value of a are shown in Table El For a = 1 the symmetric rules were 
removed. 

We noticed that for any value of a most of the rules involve the odor at- 
tribute. Indeed the inspection of data revealed that knowing the mushroom’s 
odor allows for identifying its class with 98.5% accuracy, far better than for any 
other attribute. 
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Table 2. Rules on mushroom dataset ordered by for different values of a 



rule 




class— >odor ring-type 


9.84024 


class— >odor spore-print-color 


9.16709 


class— >odor veil-color 


8.22064 


class— >odor gill-attachment 


8.2026 


class— >gill-color spore-print-color 


7.82161 


class— >■ ring- type spore-print-color 


7.62564 


class— >-odor stalk-root 


7.60198 


class— >gill-color ring-type 


7.28972 


class— >odor stalk-color-above-ring 


7.19584 


class— >odor stalk-color-below-ring 


7.14197 


rule 


Td^2,0.9 


odor^class stalk-root 


3.61877 


class stalk-root— >odor 


3.2782 


odor— >class cap-color 


2.59777 


odor— >class ring-type 


2.54896 


odor— >class spore-print-color 


2.54864 


stalk-color-above-ring— >class stalk-color-below-ring 


2.47669 


class cap-color— >odor 


2.46105 


odor— >class gill-color 


2.45027 


stalk-color-below-ring— >class stalk-color-above-ring 


2.38593 


class spore-print-color— >-odor 


2.35384 


rule 


2,1 


class stalk-root— >odor 


4.11701 


class stalk-color-below-ring— >stalk-color-above-ring 


3.38287 


stalk-color-below-ring— >class stalk-color-above-ring 


3.37968 


class ring-type— >odor 


2.98764 


class cap-color— >odor 


2.85308 


odor— >class gill-color 


2.82423 


odor— >class spore-print-color 


2.56331 


odor— >class stalk-color-below-ring 


2.44004 


class stalk-color-above-ring— >-odor 


2.42725 


class gill-color— >spore-print-color 


2.42224 



We note also that similar rules are ranked close to the top for all values 
of a, which proves that measures thoughout the family identify dependencies 
correctly. From data omitted in the tables it can be observed that, as in the case 
of synthetic data, when a approaches 1 the measures become more and more 
symmetric and unaffected by independent attributes in the consequent. 

It has been shown experimentally that measures throughout the T family 
are useful for discovering interesting dependencies among data attributes. By 
modifying a numerical attribute we can obtain a whole spectrum of measure of 
varying degree of symmetry and dependence on the presence of extra attributes 
in the rule consequent. Especially interesting seem to be measures with a param- 
eter close to, but less than 1, which combine the relative robustness against extra 
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independent attributes, while retaining the asymmetry suggesting the direction 
of the dependence. 

5 Open Problems and Fnture Directions 

Above we assumed complete confidence in the estimate of the distribution of P 
from the data. We may want to relax this restriction and assume that P has 
some a posteriori distribution ^ (not necessarily equal to Ap), and Q the prior 
distribution 0. We can then generalize T as P'p, q q,{P Q) = D{ApQ,']/ x 
0) — D{Aq, 0) — D{Ap,W). When P = Ap, T' reduces to T defined above. 
Some of the properties of T are preserved by this new definition. For example, 
if = Z?kl, and P, Q be independent, then T'q ^ j^{P — Q) = 0. Also, if 
P — S> Q is a rule in the table r = (T,P[,p) and D = Pkl then T'jj 0 ^{P — t> 
Q) = P) regardless of the choice of 0 and P. 

Further theoretical and experimental evaluation of the new measure is nec- 
essary. It would be of practical interest to find a modified general definition of 
gain that, being asymmetric, is not affected by adding independent attributes 
to the consequent. 

As a primary application, we envision using the measure in association rule 
mining systems for sorting the discovered association rules. For this purpose it 
would be necessary to generalize the measure to express the interestingness of a 
rule with respect to a system of beliefs (that could be represented for example by 
a set of rules). Then, the rule would be considered interesting if its probability 
distribution would be significantly different from the one expected based on the 
set of beliefs. See for a discussion of a similar problem. 

Further work is necessary to assess the impact that the generalized measure 
would have on other common datamining tasks like attribute selection in decision 
trees. It might, for example, be beneficial to use values of parameter a close to 
1 in the upper parts of the tree when large amount of data is still available, and 
decreasing the value of a at lower levels, where the amount of data is small and 
thus we have less confidence in the estimates of probabilities. 
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Abstract. We extend a multi-class categorization scheme proposed by 
Dietterich and Bakiri 1995 for binary classifiers, using error correcting 
codes. The extension comprises the computation of the codes by a sim- 
ulated annealing algorithm and optimization of Kullback-Leibler (KL) 
category distances within the code-words. For the first time, we apply 
the scheme to text categorization with support vector machines (SVMs) 
on several large text corpora with more than 100 categories. The results 
are compared to 1-of-N coding (i.e. one SVM for each text category). We 
also investigate codes with optimized KL distance between the text cate- 
gories which are merged in the code-words. We find that error correcting 
codes perform better than 1-of-N coding with increasing code length. For 
very long codes, the performance is in some cases further improved by 
KL-distance optimization. 



1 Introduction 

Automatic text categorization has become a vital topic in many text-mining 
applications. Imagine for example the automatic classification of Internet pages 
for a search engine database. There exist promising approaches to this task, 
among them, support vector machines (SVM) |D| are one of the most successful 
solutions. One remaining problem is however that SVM can only separate two 
classes at a time. Thus the traditional 1-of-n output coding scheme is applied 
in this case: n classes will need n classifiers to be trained independently. Early 
alternative solutions were published by Dietterich and Bakiri ^ , and Vapnik HH 
p438]. More recently, research in multi-class SVMs increased, see for example 
Guermeur et al. Platt et al. US], Crammer et al. |2|, and Allwein et.al. p. 

Hsu and Lin |H] report that for many categories, multi-class solutions which 
construct a system of binary SVMs have advantages in computational resources 
over integrative extensions of SVM. It is therefore interesting to apply the ap- 
proach of P to text categorization. Dietterich and Bakiri use a distributed out- 
put code. A second argument is that if the output code has more bits than 
needed to represent each class as a unique pattern, the additional bits may be 
used to correct classification errors. In this paper we investigate the potential 
of error correcting codes for text categorization with many categories. We ex- 
tend the work in P by a simulated annealing algorithm for code generation and 
optimization of Kullback-Leibler (KL) category distances within the code-words. 
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For the first time, we apply the scheme to text categorization on several large 
text corpora with 42 to 109 categories and up to 11 million running words. The 
results are compared to 1-of-N coding (i.e. one SVM for each text category). 

2 The Text Corpora 

Table □ gives an overview over the quantitative properties of the four different 
text corpora we used. The columns have the following meaning: categories’ 

is the number of categories to distinguish. documents’ is the total number 
of texts in the corpus. types’ is the total number of different words, or, 
more precisely, different alphanumeric strings including punctuation marks to 
be found in a corpus. tokens’ is the number of running words in a corpus, 
’min length’ is the minimal length of a document in running words (tokens). 
Shorter documents were discarded from the corpus, ’min # docs per cat’ is the 
minimal number of documents of a specific category to be found in the corpus. 
All categories with fewer documents were discarded. So each corpus was divided 
into n disjoint sets by its n remaining categories. 



Table 1. Quantitative properties of the text corpora 



corpus 


cate- 

gories 


^ docu- 
ments 


# types 


# tokens 


min 

length 


min ^ docs 
per cat. 


bz 


64 


4366 


103665 


992483 


200 


10 


renters 


42 


10216 


54832 


887357 


15 


10 


sjm 


109 


36431 


254440 


11163970 


300 


80 


wsj 


101 


41838 


203638 


11989608 


100 


50 



We did not apply any preprocessing steps such as stemming, elimination of 
stop-words, etc. to the corpora. Our research on text coding for SVMs in El 
indicated, that exhaustive inclusion of full word-forms, numbers and punctuation 
improves recognition rates. We now describe the contents of the four corpora: 

— bz: Texts from the German newspaper “Berliner Zeitung” . These texts were 
drawn from the online archive of the newspapeiEl The categorization task 
here was to recognize the author of the document. There are 64 different 
authors in total. We choose this corpus, because we already had promising 
results for authorship attribution (see [3|). 

— renters: We used the Reuters-21578 dataseifl compiled by David Lewis and 
originally collected by the Carnegie group from the Reuters newswire in 
1987. The task was to recognize the correct topic out of 42 selected topics. 
We already used a smaller subset of the renters corpus in a pilot study cni. 



^ see http://www.BerlinOnline.de 

^ (obtainable at http://www.research.att.com/~lewis/reuters21578.html) 
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— sjm: News articles from 1991 “San Jose Mercury News” from the TIPSTER 
database vol. 3. The TIPSTER catalogs are available from the Linguistic 
Data Consortium (http://www.ldc.upenn.edu) Each news document con- 
tains a list of manually-assigned codes for categories. To exclude problems 
of overlapping categories, we chose only the first item from the list as a label 
of the document. 

— wsj: Newspaper texts from the “Wall Street Journal”, 1990-92, from the 
TIPSTER database vol. 2. A for the preceding corpus we chose the first 
item from the list of categories for each text. 

Figure Q shows the type-frequency spectra of the corpora. It displays fre- 
quency on the x-axis and number of words with a given frequency on the y-axis. 
A small slope of the spectrum is an indicator of standardized language use in a 
text corpus (see HH for details). With respect to this criterion, language use in 
the bz corpus appears to by more creative than in the others. 




Fig. 1. Type- frequency spectra of the text corpora 
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3 Methods 

3.1 Error Correcting Codes 

Classification with error correcting codes can be seen “as a kind of communica- 
tions problem in which the identity of the correct output class for a new example 
is being transmitted over a channel which consists of the input features, the train- 
ing examples, and the learning algorithm” (0p266]). The classification of a new 
set of input features can no longer be determined from the output of one classi- 
fier. It is coded in a distributed representation of I outputs from all classifiers. 
Table El shows an example of an error correcting code for n = 8 classes with 
I = 5 code- words. The code- words are the columns of the table. Each classifier 
has to learn one of the code-words. This means, that the classifier should output 
a 1 for all input data belonging to one of the classes which are assigned 1 in the 
code- word of the classifier, and 0 in all other cases. The code for a specific class 
is to be found in the row of the table which is assigned to the class. The code 
length in bits is the number of code-words, 5 in our example. Note that at least 
ceiling(log 2 {n)) code- words are required to distinguish n classes. 



Table 2. Error correcting code for 8 classes with 5 code- words 





code-word 


class 


1 


2 


3 


4 


5 


1 


0 


0 


0 


1 


1 


2 


0 


0 


1 


0 


1 


3 


0 


1 


0 


0 


1 


4 


0 


1 


1 


0 


0 


5 


1 


0 


0 


1 


0 


6 


1 


1 


0 


1 


1 


7 


1 


1 


1 


0 


1 


8 


1 


1 


1 


1 


0 



Noise is introduced by the learning set, choice of features, and flaws of the 
learning algorithm. Noise may induce classification errors. But if there are more 
code-words than needed to distinguish n classes, i.e. I » log 2 {n), we can use the 
additional bits to correct errors: If the output code does not match exactly one 
of the classes, take the class with minimal Hamming distance. If the minimum 
Hamming distance between class codes is d, we can in this way correct at least 
single bit errors (see ^ p.266]). It is therefore important to use codes with 
maximized Hamming distance for the classes. 

There are several potential advantages in this approach: The number of re- 
quired classifiers increases with 0 {log 2 {n)) only, and additional bits can be used 
for error correction. 

We did not use the optimization criteria given by Dietterich and Bakiri to 
find optimal codes. Instead, we used simulated annealing to optimize a mix 
of Hamming distances of class codes, and of code-words. For a corpus with n 
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Table 3. Minimal Hamming distances for the sjm-codes 



# code 


min Hamming distance 


KL-distance 


words 


categories 


code-words 


random 


optimized 


54 


21 


48 


4.080 


4.054 


81 


34 


46 


4.085 


4.059 


109 


47 


43 


4.084 


4.066 


164 


75 


40 


4.084 


4.072 



categories to distinguish, we generated error correcting codes of four different 
sizes: n ■ 0.5, n ■ 0.75, n • 1.0, and n ■ 1.5, bits. 

See Table 0for the Hamming distances of a set of codes we used to categorize 
the sjm-corpus. For larger numbers of code- words, i.e. increasing I, the hamming 
distances grow. The reason is that the number of class codes to be generated 
remains constant (n = 109) and therefore we have more degrees of freedom to 
place the bits in the class codes. The Hamming distances of code- words decrease 
with increasing 1. Here, we have constant length of code- words, but increasing 
numbers of them. Therefore we have decreasing degrees of freedom to design the 
code- words. 



3.2 Optimized Kullback-Leibler Distance 

The mapping of corpus categories onto error correcting codes has been arbitrary 
so far. Remember that each SVM in the classifier system has to implement cat- 
egorization according to one code- word, i.e. a special column of the code matrix 
like the one shown in Table 01 This means that it has to recognize documents 
from each category labeled 1 in that column, and to reject all categories labeled 
0 . 

Our hypothesis was that a reordering of classes such that more similar classes 
were grouped together in each of the code-words should improve the perfor- 
mance. We used the Kullback-Leibler test [Bl p.57] to compute the distance 
KLD{pi,P 2 ) between two categories p\ and p 2 as shown in equation^ 

KLD{pi,p2) = ^pi(j)log(^^) -k^P2(i)log(^^) (1) 

q{^) = 

The two categories are represented by their word-frequency vectors pi and p 2 
of length n. These vectors contain the frequencies of all types in the corpus, but 
the counts are restricted to documents which belong to the category in question. 
Therefore some values may be equal to 0. 

The Kullback-Leibler Distance of a whole Matrix M of error-correcting codes 
is now computed by comparing the categories with label 1 against each other 
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and also the categories with label 0. Let Mi be the set of text categories which 
are labeled 1 in M and Mq the set of text categories which are labeled 0 in M. 
Then we can define the Kullback-Leibler Distance of M as 



KLD{M) 



Y.p„p,(^MoKLD{p,,pj) 

Iq * {Iq — 1) * 0.5 



KLD{pi,pj) 

h * {h - 1) *0.5 



( 2 ) 



lo is the number of elements in Mq, is the number of elements in Mi. 

We tested optimization of KLD{M) after generation of the code Matrix M, 
as well as simultaneous optimization of Hamming distance and Kullback-Leibler 
distances. In this paper we only report results on the simultaneous optimization, 
because the performance was better in general. See Table 0for KLD values of 
the sjm corpus. 



3.3 Support Vector Machines 

Support Vector Machines (SVM) recently gained popularity in the learning com- 
munity m In its simplest linear form, an SVM is a hyperplane that separates a 
set of positive examples from a set of negative examples with maximum interclass 
distance, the margin. 

The SVM can be extended to nonlinear models by mapping the input space 
into a very high-dimensional feature space chosen a priori. In this space the 
optimal separating hyperplane is constructed m P-421]. 

The distinctive advantage of the SVM for text categorization is its ability 
to process many thousand different inputs. This opens the opportunity to use 
all words in a text directly as features. For each word Wi the number of times 
of occurrence is recorded. Joachims [3| and also we dH used the SVM for the 
classification of text into different topic categories. Dumais et al. |S| use linear 
SVM for text categorization because they are both accurate and fast. They are 
35 times faster to train than the next most accurate (a decision tree) of the 
tested classifiers. They apply SVM to the renters collection, e-mails and web 
pages. 



3.4 Transformations of Frequency Vectors and Kernel Functions 

The mapping of text to the SVM input space consists of three parts. First the 
type-frequencies (i.e. number of occurrences of words) are transformed by a 
bijective mapping. The resulting vector is multiplied by a vector of importance 
weights, and is finally normalized to unit length. We used those transformations 
that yielded the best results in El. 

We define the vector of logarithmic type frequencies of document di by 



h = 



log(l + f{wi,d ^)),. . . ,log(l -k f{Wn, d*)) 



( 3 ) 
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where f{wk, di) is the frequency of occurrence of term Wk in document di, and n 
is the number of different terms in all documents of the collection. Logarithmic 
frequencies are combined with different importance weights. They are normalized 
with respect to L2- 

Importance weights can be used to quantify how specific a given type is to 
the documents of a text collection. A type which is evenly distributed across 
the document collection should be given a low importance weight because it is 
judged to be less specific for the documents it occurs in. A type which is used in 
only a few documents should be given a high importance weight. Redundancy 
quantifies the skewness of a probability distribution, and is a measure of how 
much the distribution of a term Wk in the various documents deviates from the 
uniform distribution. 

We therefore consider the empirical distribution of a type over the documents 
in the collection and define the importance weight of type Wk by 



Tk 



N 

logN + Y, 

2=1 



f{wk,di) f{wk,di) 

f{wk) f{Wk) ’ 



( 4 ) 



where N is the number of documents in the collection. This yields a vector of 
importance weights for the whole document collection: 



r = (ri,...,r„). 



( 5 ) 



The advantage of redundancy over inverse document frequency is that it 
does not simply count the documents a type occurs in but takes into account 
the frequencies of occurrence in each of the document. The difference between 
redundancy and idf is larger for longer documents. 

From the standpoint of the SVM learning algorithm the best normaliza- 
tion rule is the L2-normalization because it yields the best error bounds. L2- 
normalization has been used by Joachims and Dumais et al. |^. So the com- 
plete frequency transformation we used is defined as 



_ b * r 
“ llb*r|U, 

We used only the linear kernel function K{x, x') = a; -a;' of the SVM, because 
increased computing times prevented tests with nonlinear kernels. Furthermore 
linear kernels performed very well in our pilot study m- 

4 Performance Measures 

We applied performance measures which are widely used in information retrieval, 
because we think they are more adequate to text categorization than plain error 
rates. 
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4.1 Definitions 



Precision (equation EJ is the percentage of documents in the target category i 
of all those documents which are categorized (perhaps wrongly) as category i. 



prc{i) = 100 * - 



ct{i) 



ct(i) + eo(i ) ' 



( 6 ) 



prc(i) = 0 if Ct(i) = 0 

Recall (equation I3 is the percentage of documents in target category i which 
are recognized correctly. 



rec{i) = 100 + 



ctji) 

ct{i) + et{i) 



( 7 ) 



Ct (i) correctly categorized documents of target class 
et(i) wrongly categorized documents of target class 
Co{i) correctly categorized documents of other classes 
eo{i) wrongly categorized documents of other classes 



Since precision and recall may be inadequate to measure the performance on 
very small categories, we also computed a performance measure combined from 
precision and recall. It is called the F-measure (see ji 2| p.269]): 

+ (!-«) ( 8 ) 

\ prc[i) rec[i) I 

a is a weighting factor. In this paper we set a = 0.5, because we wanted to put 
equal importance on both precision and recall. Thus we get F{i) = 2- prc{i)+rec(i) 
The mean performance over all N categories of a corpus is 



N 

prc = ^ pt (i) * prc{i) (9) 

N 

rec ='^^^Pt{i) • rec{i) (10) 

N 

F = Y,pt{^)‘m ( 11 ) 

i=l 

Pt(i) is the probability of occurrence of category i in the corpus. We determined 
Pt{i) as the fraction of documents of category i in all documents of the corpus. 
Weighted in this way, prc, rec, and F again can vary between 0% and 100%. 



274 J. Kindermann, G. Paass, and E. Leopold 



4.2 Results 

To exploit the training set Sq of each corpus in a better way, we used the follow- 
ing cross-testing procedure. Sq was randomly divided into 3 subsets Si, 82,83 
of nearly equal size. Then three different SVM classification systems were deter- 
mined using Sq\ Si as training set and Si was used as test set. The numbers 
of correctly and wrongly classified documents were added up yielding an ef- 
fective test set of all documents in Sq- The SVM implementation of Joachims 
(http://ais.gmd.de/~thorsten/svm_light/) was used for our experiments. 

Table! shows the mean performance with respect to precision (equation 0 , 
recall (equation ITHll . and F-measure (equation I 111) . The first four rows give the 
values for the error correcting codes of different bit length (Sect. 13.111 . The last 
row gives the value of the corresponding 1-of-N codes. The best values are printed 
in bold font. Table! shows the corresponding values for error correcting codes 
with optimized KL-distance. 

In both tables the 1-of-N coding has better precision for corpora bz and 
renters and better recall for for corpora sjm and wsj. The combined F-measure 
however is better for error correcting codes on all four corpora. 

Comparing tables with and without KL-optimization, there is no clear ten- 
dency. It seems that for corpora bz and sjm there generally is a slight improve- 
ment. A slight degradation can be seen in most cases of renters and wsj. 

Table 0 shows the percentage of categories on which error correcting codes 
perform better than the 1-of-N code with respect to precision (equation 0 , recall 
(equation IIUII . and F-measure (equation II 111 . Table 0 shows the corresponding 
percentages for error correcting codes with optimized KL-distance. 

We observe converse trends for precision and recall: Those corpora with many 
good results for error correcting codes on precision tend to perform bad on recall 
and vice versa. 

Comparing tables with and without KL-optimization, there is a mixed ten- 
dency for precision and recall. Regarding the F-measure, there is improvement 
for most KL-optimized codes. 

5 Conclusion 

We investigated the potential of error correcting codes for text categorization 
on several large text collections. The error correcting code classifier was imple- 
mented as a combination of binary SVMs. We compared the results with those 
of a 1-of-N classifier. The main result is that long error correcting codes perform 
better than 1-of-N codes for a combination of precision and recall error measures. 

We also investigated the effects of optimization of the Kullback-Leibler dis- 
tance of the text categories grouped together in the code- words. The classifica- 
tion performance could be improved slightly for the corpora bz and sjm, but the 
latter result seems to be of mostly theoretical interest. 

Acknowledgments. We would like to thank Thorsten Joachims and Tamas 
Horvath for inspiring discussions on this topic. 
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Table 4. Mean performance on all cor- 
pora - no KL-optimization 



code length 


bz 


renters 


sjm 


wsj 


50% 


80.4 


91.5 


58.3 


59.0 


75% 


79.8 


91.4 


57.7 


58.6 


100% 


79.8 


91.4 


58.6 


58.2 


150% 


79.6 


91.8 


59.8 


58.7 


1-of-n 


89.0 


93.2 


53.2 


47.6 



precision 



code length 


bz 


renters 


sjm 


wsj 


50% 


78.2 


91.8 


58.5 


57.7 


75% 


79.2 


91.5 


59.0 


58.5 


100% 


79.7 


91.5 


59.8 


58.3 


150% 


80.2 


91.7 


59.6 


58.8 


1-of-n 


60.7 


89.1 


63.9 


69.0 



recall 



code length 


bz 


renters 


sjm 


wsj 


50% 


76.9 


91.4 


55.5 


55.7 


75% 


76.8 


91.0 


55.5 


56.5 


100% 


77.2 


91.0 


56.5 


56.5 


150% 


78.0 


91.3 


56.1 


57.0 


1-of-n 


69.5 


90.4 


55.9 


54.9 



F-measure 



Table 5. Mean performance on all cor- 
pora - with KL-optimization. 



code length 


bz 


renters 


sjm 


wsj 


50% 


80.6 


91.2 


58.2 


58.3 


75% 


80.3 


91.2 


58.4 


58.8 


100% 


79.6 


91.3 


58.3 


58.3 


150% 


80.9 


91.5 


58.9 


58.7 


1-of-n 


89.0 


93.2 


53.2 


47.6 



precision 



code length 


bz 


renters 


sjm 


wsj 


50% 


79.0 


91.4 


58.7 


57.5 


75% 


80.0 


91.3 


59.3 


58.6 


100% 


80.4 


91.5 


59.7 


58.6 


150% 


80.7 


91.6 


59.9 


58.6 


1-of-n 


60.7 


89.1 


63.9 


69.0 



recall 



code length 


bz 


renters 


sjm 


wsj 


50% 


77.1 


91.0 


55.8 


55.7 


75% 


78.0 


90.9 


55.8 


56.8 


100% 


78.2 


91.1 


56.4 


56.6 


150% 


78.5 


91.3 


56.7 


56.8 


1-of-n 


69.5 


90.4 


55.9 


54.9 



F-measure 
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Table 6. Percentage of categories on 
which EC-codes perform better than 1- 
of-N codes - no KL-optimization 



Table 7. Percentage of categories on 
which EC-codes perform better than 1- 
of-N codes - with KL-optimization 



code length 


bz 


renters 


sjm 


wsj 




code length 


bz 


renters 


sjm 


wsj 


50% 


15.6 


23.8 


74.3 


47.7 




50% 


20.3 


14.3 


73.3 


56.0 


75% 


20.3 


14.3 


83.2 


60.6 




75% 


17.2 


16.7 


84.2 


63.3 


100% 


18.8 


11.9 


82.2 


56.9 




100% 


14.1 


19.0 


83.2 


58.7 


150% 


17.2 


23.8 


89.1 


63.3 




150% 


20.3 


16.7 


91.1 


65.1 



precision precision 



code length 


bz 


renters 


sjm 


wsj 




code length 


bz 


renters 


sjm 


wsj 


50% 


84.4 


66.7 


17.8 


44.0 




50% 


89.1 


59.5 


12.9 


40.4 


75% 


85.9 


76.2 


21.8 


43.1 




75% 


90.6 


78.6 


15.8 


40.4 


100% 


90.6 


83.3 


16.8 


48.6 




100% 


90.6 


81.0 


16.8 


46.8 


150% 


89.1 


85.7 


19.8 


37.6 




150% 


95.3 


92.9 


12.9 


51.4 



recall recall 



code length 


bz 


renters 


sjm 


wsj 




code length 


bz 


renters 


sjm 


wsj 


50% 


73.4 


45.2 


37.6 


41.3 




50% 


79.7 


31.0 


35.6 


45.0 


75% 


75.0 


52.4 


43.6 


45.9 




75% 


78.1 


52.4 


46.5 


46.8 


100% 


65.6 


59.5 


48.5 


56.0 




100% 


75.0 


61.9 


47.5 


57.8 


150% 


75.0 


69.0 


60.4 


46.8 




150% 


79.7 


76.2 


53.5 


61.5 



F-measure F-measure 
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Abstract. The fact that data is scattered over many tables causes many prob- 
lems in the practice of data mining. To deal with this problem, one either con- 
structs a single table by hand, or one uses a Multi-Relational Data Mining algo- 
rithm. In this paper, we propose a different approach in which the single table is 
constructed automatically using aggregate functions, which repeatedly summa- 
rise information from different tables over associations in the datamodel. Fol- 
lowing the construction of the single table, we apply traditional data mining al- 
gorithms. Next to an in-depth discussion of our approach, the paper presents re- 
sults of experiments on three well-known data sets. 



1 Introduction 

An important practical problem in data mining is that we often want to find models 
and patterns over data that resides in multiple tables. This is solved by either con- 
structing a single table by hand (deriving attributes from the other tables) or by using 
a Multi-Relational Data Mining or ILP approach. In this paper we propose another 
approach, viz., automatic construction of the single mining table using aggregates. 

The motivation for the use of aggregates stems from the observation that the diffi- 
cult case in constructing a single table is when there are one-to-many relationships 
between tables. The traditional way to summarise such relationships in Statistics and 
OLAP is through aggregates that are based on histograms, such as count, sum, min, 
max, and avg. We limit ourselves to these aggregates, but note that they can be ap- 
plied recursively over a collection of relationships. 

The idea of propositionalisation (the construction of one table) is not new. Several 
relatively successful algorithms have been proposed in the context of Inductive Logic 
Programming (ILP) [6, 12, 7, 1, 2]. A common aspect of these algorithms is that the 
derived table consists solely of binary features, each corresponding to a (promising) 
clause discovered by an ILP-algorithm. Especially for numerical attributes, our ap- 
proach leads to a markedly different search space. 

We illustrate our approach on three well-known data sets. The aim of these ex- 
periments is twofold. Firstly, to demonstrate the accuracy in a range of domains. Sec- 
ondly, to illustrate the radically different way our approach models structured data, 
compared to ILP or MRDM approaches. 

L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 277-288, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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The paper is organised as follows. First we discuss propositionalisation and aggre- 
gates in more detail. In particular we introduce the notion of depth, to illustrate the 
complexity of the search space. Next we introduce the RollUp algorithm that con- 
structs the single table. Then we present the results of our experiments and the paper 
ends with a discussion and conclusions. 

2 Propositionalisation 

In this section we describe the basic concepts involved in propositionalisation, and 
provide some definitions. In this paper, we define propositionalisation as the process 
of transforming a multi-relational dataset, containing structured examples, into a pro- 
positional dataset with derived attribute-value features, describing the structural prop- 
erties of the examples. The process can thus be thought of as summarising data stored 
in multiple tables in a single table (the target table) containing one record per exam- 
ple. The aim of this process, of course, is to pre-process multi-relational data for sub- 
sequent analysis by attribute-value learners. 

We will be using this definition in the broadest sense. We will make no assump- 
tions about the datatype of the derived attribute (binary, nominal, numeric, etc.) nor 
do we specify what language will be used to specify the propositional features. Tradi- 
tionally, propositionalisation has been approached from an ILP standpoint with only 
binary features, expressed in first-order logic (FOL)[6, 7, 1, 2]. To our knowledge, the 
use of other aggregates than existence has been limited. One example is given in [4], 
which describes a propositionalisation-step where numeric attributes were defined for 
counts of different substructures. [5] also mentions aggregates as a means of 
establishing probabilistic relationships between objects in two tables. It is our aim to 
analyse the applicability of a broader range of aggregates. 

With a growing availability of algorithms from the fields of ILP and Multi- 
Relational Data Mining (MRDM), one might wonder why such a cumbersome pre- 
processing step is desirable in the first place, instead of applying one of these algo- 
rithms to the multi-relational data directly. The following is a (possibly incomplete) 
list of reasons: 

• Pragmatic choice for specific propositional techniques. People may wish to apply 
their favourite attribute-value learner, or only have access to commercial of-the- 
shelf Data Mining tools. Good examples can be found in the contributions to the 
financial dataset challenge at PKDD conferences [14]. 

• Superiority of propositional algorithms with respect to certain Machine Learning 
parameters. Although extra facilities are quickly being added to existing ILP en- 
gines, propositional algorithms still have a head-start where it concerns handling 
of numeric values, regression, distance measures, cumulativity etc. 

• Greater speed of propositional algorithms. This advantage of course only holds if 
the preceding work for propositionalisation was limited, or performed only once 
and then reused during multiple attribute-value learning sessions. 

• Advantages related to multiple consecutive learning steps. Because we are apply- 
ing two learning steps, we are effectively combining two search strategies. The 
first step essentially transforms a multi-relational search space into a proposi- 
tional one. The second step then uses these complex patterns to search deeper 
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than either step could achieve when applied in isolation. This issue is investigated 
in more detail in the remainder of this section. 

The term propositionalisation leads to some confusion because, although it per- 
tains to the initial step of flattening a multi-relational database, it is often used to 
indicate the whole approach, including the subsequent propositional learning step. 
Because we are mostly interested in the two steps in unison, and for the sake of dis- 
cussion, we introduce the following generic algorithm. The name is taken from 
Czech, and indicates a two-step dance. 

Polka (DB D; DM M; int r, p) 

P := MRDM (D, M, r); 

R := PDM (P, p); 

The algorithm takes a database D and datamodel M (acting as declarative bias), 
and first applies a Multi-Relational Data Mining algorithm MRDM. The resulting 
propositional features P are then fed to a propositional Data Mining algorithm PDM, 
producing result R. We use the integers r and p very informally to identify the extent 
of the multi-relational and propositional search, respectively. Note that the proposi- 
tionalisation step is independent of subsequent use in propositional learning. 

In order to characterise more formally the extent of the search, we introduce three 
measures that are functions of the patterns that are considered. The values of the 
measures for the most complex patterns in the search space are then measures for the 
extent of the search algorithm. We can thus characterise both individual patterns, as 
well as algorithms. The definition is based on the graphical pattern language of Selec- 
tion Graphs, introduced in [8], but can be re-written in terms of other languages such 
as FOL, relational algebra or SQL. We first repeat our basic definition of Selection 
Graphs. 

Definition. A selection graph G is a pair {N, E), where A is a set of pairs (t, C), t is a 
table in the data model and C is a, possibly empty, set of conditions on attributes in t 
of type t.a operator c; the operator is one of the usual selection operators, =, >, etc. E 
is a set of triples (p, q, a) called selection edges, where p and q are selection nodes 
and a is an association between p.t and q.t in the data model. The selection graph 
contains at least one node ng (the root node) that corresponds to the target table tg. 

Now assume G is a Selection Graph. 

Definition, variable-depth: d^ (G) equals the length of the longest path in G. 

Definition, clause-depth: d^ (G) equals the sum of the number of non-root 

nodes, edges and conditions in G. 

Definition, variable-width: Wy (G) equals the largest sum of the number of 

conditions and children per node, not including the root-node. 

The intuition of these definitions is as follows. An algorithm searches variable- 
deep, if pieces of discovered substructure are refined by adding more substructure. 
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resulting in chains of variables (edges in Selection Graphs). With each new variable, 
information from a new table is involved. An algorithm searches clause-deep, if it 
considers very specific patterns, regardless of the number of tables involved. Even 
propositional algorithms may produce clause-deep patterns that contain many condi- 
tions at the root-node and no other nodes. Rather than long chains of variables, vari- 
able-wide algorithms are concerned with the frequent reuse of a single variable. If 
information from a new table is included, it will be further refined by extra restric- 
tions, either through conditions on this information, or through further substructure. 

Example. The following Selection Graph, 
which refers to a 3-table database intro- 
duced in [8], identifies parents above 40 
who have a child and bought a toy. The 
measures produce the following complexity 
characteristics: 

d,(G)=l, de(G)=5, w,(G)=0 

The complexity measures can now be 
used to relate the search depth of Polka to 
the propositional and multi-relational algo- Toy 

rithm it is made up of. 

Lemma 1. d, (Polka) = d^ (MRDM) 

Lemma 2. de (Polka) = de (MRDM) • de (PDM) 

Lemma 3. w^, (Polka) = w^, (MRDM) 

Not surprisingly, the complexity of Polka depends largely on the complexity of 
the actual propositionalisation step. However, lemma 2 demonstrates that Polka con- 
siders very clause-deep patterns, in fact deeper than a multi-relational algorithm 
would consider in isolation. This is due to the combining of search spaces mentioned 
earlier. Later on we will examine the search restrictions that the use of aggregates 
have on the propositionalisation step and thus on Polka. 

3 Aggregates 

In the previous section we observed that an essential element of propositionalisation 
is the ability to summarise information distributed over several tables in the target 
table. We require functions that can reduce pieces of substructure to a single value, 
which describes some aspects of this substructure. Such functions are called aggre- 
gates. Having a set of well-chosen aggregates will allow us to describe the essence of 
the structural information over a wide variety of structures. 

We define an aggregate as a function that takes as input a set of records in a data- 
base, related through the associations in the data model, and produces a single value 
as output. We will be using aggregates to project information stored in several tables 
on one of these tables, essentially adding virtual attributes to this table. In the case 
where the information is projected on the target table, and structural information be- 
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longing to an example is summarised as a new feature of that example, aggregates can 
be thought of as a form of feature construction. 

Our broad definition includes aggregates of a great variety of complexity. An im- 
portant aspect of the complexity of an aggregate is the number of (associations be- 
tween) tables it involves. As each aggregate essentially considers a subset of the data 
model, we can use our 3 previously defined complexity-measures for data models to 
characterise aggregates. Specifically variable-depth is useful to classify aggregates. 
An aggregate of variable-depth 0 involves just one table, and is hence a case of pro- 
positional feature construction. In their basic usage, aggregates found in SQL (count, 
min, sum, etc.) have a variable-depth of 1, whereas variable-deeper aggregates repre- 
sent some form of Multi-Relational pattern (benzene-rings in molecules, etc.). Using 
this classification of variable-depth we give some examples to illustrate the range of 
possibilities. 

d, (A) = 0; 

• Propositions (adult == (age ! 18)) 

• Arithmetic functions (area == width- length) 

d,(A)=l; 

• Count, count with condition 

• Count distinct 

• Min, max, sum, avg 

• Exists, exists with condition 

• Select record (eldest son, first contract) 

• Predominant value 

d,(A)> 1: 

• Exists substructure 

• Count substructure 

• Conjunction of aggregates (maximum count of children) 

Clearly the list of possible classes of aggregates is long, and the number of instances 
is infinite. In order to arrive at a practical and manageable solution for propositionali- 
sation we will have to drastically limit the range of classes and instances. Apart from 
deterministic and heuristic rules to select good candidates, pragmatic limitations to a 
small set of aggregate classes are unavoidable. In this paper we have chosen to restrict 
ourselves to the classes available in SQL, and combinations thereof. The remainder of 
this paper further investigates the choice of instances. 

4 Summarisation 

We will be viewing the propositionalisation process as a series of steps in which in- 
formation in one table is projected onto records in another table successively. Each 
association in the data model gives rise to one such step. The specifics of such a step, 
which we will refer to as summarisation, are the subject of this section. 
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Let us consider two tables P and Q, neither of which needs to be the target table, 
that are joined by an association A. By summarising over A, information can be added 
to P about the structural properties of A, as well as the data within Q. To summarise 
Q, a set of aggregates of variable-depth 1 are needed. 

As was demonstrated before in [8], the multiplicity of association A influences the 
search space of multi-relational patterns involving A. The same is true for summarisa- 
tion over A using aggregates. Our choice of aggregates depends on the multiplicity of 
A. In particular if we summarise Q over A only the multiplicity on the side of Q is 
relevant. This is because an association in general describes two relationships be- 
tween the records in both tables, one for each direction. The following four options 
exist: 

1 For every record in P there is but a single record in Q. This is basically a 

look-up over a foreign key relation and no aggregates are required. A simple 
join will add all non-key attributes of Q to P. 

0..1 Similar to the 1 case, but now a look-up may fail because a record in P may 
not have a corresponding record in Q. An outer join is necessary, which fills 
in NULL values for missing records. 

l..n For every record in P, there is at least one record in Q. Aggregates are re- 
quired in order to capture the information in the group of records belonging 
to a single record in P. 

0..n Similar to the L.n case, but now the value of certain aggregates may be un- 
defined due to empty groups. Special care will need to be taken to deal with 
the resulting NULL values. 

Let us now consider the L.n case in more detail. A imposes a grouping on the re- 
cords in Q. For m records in P there will be m groups of records in Q. Because of the 
set-semantics of relational databases every group can be described by a collection of 
histograms or data-cubes. We can now view an aggregate instance as a function of 
one of these types of histograms. For example the predominant aggregate for an at- 
tribute Q.a simply returns the value corresponding to the highest count in the histo- 
gram of Q.a. Note that m groups will produce m histograms and thus m values for one 
aggregate instance, one for each record in P. The notion of functions of histograms 
helps us to define relevant aggregate classes. 

count. The count aggregate is the most obvious aggregate through its direct relation 
to histograms. The most basic instance without conditions simply returns the single 
value in the 0-dimensional histogram. Adding a single condition requires a 1- 
dimensional histogram of the attribute involved in the condition. For example the 
number of sons in a family can be computed from a histogram of gender of that fam- 
ily. An attribute with a cardinalty c will produce c aggregate instances of count with 
one condition. It is clear that the number of instances will explode if we allow even 
more conditions. As our final propositional dataset will then become impractically 
large we will have to restrict the number of instances. We will only consider counts 
with no condition and counts with one condition on nominal attributes. This implies 
that for the count aggregate Wy " 1 . 

There is some overlap in the patterns that can be expressed by using the count ag- 
gregate and those expressed in FOL. Testing for a count greater than zero obviously 
corresponds to existence. Testing for a count greater than some threshold t however. 
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requires a clause-depth of O(t^) in FOL. With the less-than operator things become 
even worse for FOL representations as it requires the use of negation in a way that the 
language bias of many ILP algorithms does not cater for. The use of the count aggre- 
gate is clearly more powerful in these respects. 

min and max. The two obvious aggregates for numeric attributes, min and max, ex- 
hibit similar behaviour. Again there is a trivial way of computing min and max from 
the histogram; the smallest and largest value for which there is a non-zero count, 
respectively. The min and max aggregates support another type of constraint com- 
monly used in FOL-based algorithms, existence with a numeric constraint. The fol- 
lowing proposition describes the correspondence between the minimum and maxi- 
mum of a group of numbers, and the occurrence of particular values in the group. 

Proposition. Let Z? be a bag of real numbers, and t some real, then 
max (B)>t iff 3 v e B : v > t, 

min (B)<t iff 3 v e B : v <t. 

This simply states that testing whether the maximum is greater than some thresh- 
old is equivalent to testing whether any value is greater than t. Analogous for min. It 
is important to note the combination of max and >, and min and < respectively. If max 
were to be used in combination with < or = then the FOL equivalent would again 
require the use of negation. Such use of the min and max aggregate gives us a natural 
means of introducing the universal quantor V : all values are required to be above the 
minimum, or below the maximum. Another advantage of the min and max aggregate 
is that they each replace a set of binary existence aggregate instances (one for each 
threshold), making the propositional representation a lot more compact. 

In short we can conclude that on the level of summarisation (dv = 1) aggregates 
can express many of the concepts used in FOL. They can even express concepts that 
are hard or impossible to express in FOL. The most important limitation of our choice 
of aggregate instances is the number of attributes involved: w^, " 1. This restriction 
prevents the use of combinations of attributes which cause a combinatorial explosion 
of features [10]. 

5 The RollUp Algorithm 

With the basic operations provided in the previous sections we can now define a basic 
propositionalisation algorithm. The algorithm will traverse the data model graph and 
repeatedly use the summarisation operation to project data from one table onto an- 
other, until all information has been aggregated at the target table. Although this re- 
peated summarisation can be done in several ways, we will describe a basic algo- 
rithm, called RollUp. 

The RollUp algorithm performs a depth-first search (DFS) through the data model, 
up to a specified depth. Whenever the recursive algorithm reaches its maximum depth 
or a leaf in the graph, it will “roll up” the relevant table by summarising it on the 
parent in the DFS tree. Internal nodes in the tree will again be summarised after all its 
children have been summarised. This means that attributes considered deep in the tree 
may be aggregated multiple times. The process continues until all tables are summa- 
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rised on the target table. In combination with a propositional learner we have an in- 
stance of Polka. The following pseudo code describes RollUp more formally: 

RollUp (Table T, Datamodel M, int d) 

T:= T; 
if^/>0 

for all associations A from TinM 

W := RollUp(r.getTable(A), M, <i-l); 

S := Summarise(lT, A); 

P.add(5); 

return R; 

The effect of RollUp is that each attribute appearing in a table other than the target 
table will appear several times in aggregated form in the resulting view. This multiple 
occurrence happens for two reasons. The first reason is that tables may occur multiple 
times in the DFS tree because they can be reached through multiple paths in the data- 
model. Each path will produce a different aggregation of the available attributes. The 
second reason is related to the choices of aggregate class at each summarisation along 
a path in the datamodel. This choice, and the fact that aggregates may be combined in 
longer paths produces multiple occurrences of an attribute per path. 

The variable-depth of the deepest feature is equal to the parameter d. Each feature 
corresponds to at most one attribute aggregated along a path of depth dy. The clause- 
depth is therefore a linear function of the variable-depth. As each feature involves at 
most one attribute, and is aggregated along a path with no branches, the variable- 
width will always be either 0 or 1. This produces the following characteristics for 
RollUp. Use lemmas 1 to 3 to characterise Polka instantiated with RollUp. 

Lemma 4. d, (RollUp) = d 

Lemma 5. dc (RollUp) = 2 dy (RollUp) -i- 1 

Lemma 6. w^, (RollUp) = 1 

6 Experiments 

In order to acquire empirical knowledge about the effectiveness of our approach, we 
have tested RollUp on three well-known multi-relational datasets. These datasets were 
chosen because they show a variety of datamodels that occur frequently in many 
multi-relational problems. They are Musk [3], Mutagenesis [11], and Einancial [14]. 

Each dataset was loaded in the RDBMS Oracle. The data was modelled in UML 
using the multi-relational modelling tool Tolkien (see [9]) and subsequently translated 
to CDBL. Based on this declarative bias, the RollUp module produced one database 
view for each dataset, containing the propositionalised data. This was then taken as 
input for the common Machine Learning procedure C5.0. 

For quantitative comparison with other techniques, we have computed the average 
accuracy by leave-one-out cross-validation for Musk and Mutagenesis, and by 10-fold 
cross-validation for Financial. 
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6.1 Musk 

The Musk database [3] describes molecules occurring in different conformations. 
Each molecule is either musk or non-musk and one of the conformations determines 
this property. Such a problem is known as a multiple-instance problem, and will be 
modelled by two tables molecule and conformation, joined by a one-to-many asso- 
ciation. Confirmation contains a molecule identifier plus 166 continuous features. 
Molecule just contains the identifier and the class. We have analysed two datasets, 
MuskSmall, containing 92 molecules and 476 confirmations, and MuskLarge, con- 
taining 102 molecules and 6598 confirmations. The resulting table contains a total of 
674 features. 

Table 1 shows the results of RollUp compared to other, previously published re- 
sults. The performance of RollUp is comparable to Tilde, but below that of special- 
purpose algorithms. 



Table 1. Results on musk 



Algorithm 


MuskSmall 


MuskLarge 


Iterated-discrim APR 


92.4% 


89.2% 


GFS elim kde APR 


91.3% 


80.4% 


RollUp 


89.1% 


77.5% 


Tilde 


87.0% 


79.4% 


Back-propagation 


75.0% 


67.7% 


C4.5 


68.5% 


58.8% 



6.2 Mutagenesis 

Similar to the Musk database, the Mutagenesis database describes molecules falling in 
two classes, mutagenic and non-mutagenic . However, this time structural information 
about the atoms and bonds that make up the compound are provided. As chemical 
compounds are essentially annotated graphs, this database is a good test-case for how 
well our approach deals with graph-data. The dataset we have analysed is known as 
the ‘regression-friendly’ dataset, and consists of 188 molecules. The database consists 
of 26 tables, of which three tables directly describe the graphical structure of the 
molecule (molecule, atom and bond). The remaining 23 tables describe the occur- 
rence of predefined functional groups, such as benzene rings. 

Four different experiments will be performed, using different settings, or so-called 
backgrounds. They will be referred to as experiment B 1 to B4: 

• Bl: the atoms in the molecule are given, as well as the bonds between them; the 
type of each bond is given as well as the element and type of each atom. 

• B2: as Bl, but continuous values about the charge of atoms are added. 

• B3: as B2, but two continuous values describing each molecule are added. 

• B4: as B3, but knowledge about functional groups is added. 

The largest resulting table, for B4, contains 309 constructed features. 

Table 2 shows the results of RollUp compared to other, previously published re- 
sults. Clearly RollUp outperforms the other methods on all backgrounds, except B4. 
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Most surprisingly, RollUp already performs well on Bl, whereas the ILP methods 
seem to benefit from the propositional information provided in B3. 



Table 2. Results on mutagenesis 





Progol 


FOIL 


Tilde 


RollUp 


Bl 


76% 


61% 


75% 


86% 


B2 


81% 


61% 


79% 


85% 


B3 


83% 


83% 


85% 


89% 


B4 


88% 


82% 


86% 


84% 



Example. The following tree of the B3 experiment illustrates the use of aggregates 
for structural descriptions. 



CNT_BOND =< 26 

PREDOMINANT_TYPE_ATOM [21 27] -> F 

PREDOMINANT_TYPE_ATOM 22 -> F 
PREDOMINANT_TYPE_ATOM 3 

MAX_CHARGE_ATOM =< 0.0 

PREDOMINANT_TYPE_BOND 7 -> F 
PREDOMINANT_TYPE_BOND 1 -> T 
MAX_CHARGE_ATOM > 0.0 - > F 
CNT_BOND >26 

LUMO =< -1.102 

LOGP =< 6.26 -> T 
LOGP > 6.26 -> F 
LUMO > -1.102 -> F 



6.3 Financial 

Our third database is taken from the Discovery Challenge organised at PKDD ’99 and 
PKDD 2000 [14]. The database is based on data from a Czech bank. It describes the 
operations of 5369 clients holding 4500 accounts. The data is stored in 8 tables, 4 of 
which describe the usage of products, such as credit cards and loans. Three tables 
describe client and account information, and the remaining table contains demo- 
graphic information about 77 Czech districts. We have chosen the account table as 
our target table. Although we thus have 4500 examples, the dataset contains a total of 
1079680 records. Our aim was to determine the loan-quality of an account, that is the 
existence of a loan with status ‘finished good loan’ or ‘running good loan’. The result- 
ing table contains a total of 148 features. 

A near perfect score of 99.9% was achieved on the Financial dataset. Due to the 
great variety of problem definitions described in the literature, quantitative compari- 
sons with previous results are impossible. Similar (descriptive) analyses of loan- 
quality however never produced the pattern responsible for RollUp’ s performance. 
The aggregation approach proved particularly successful on the large transaction 
table (1056320 records). This table has sometimes been left out of other experiments 
due to scalability problems. 
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7 Discussion 

The experimental results in the previous section demonstrate that our approach is at 
least competitive with existing multi-relational techniques, such as Progol and Tilde. 
Our approach has two major differences with these techniques, which may be the 
source of the good performance: the use of aggregates and the use of propositionalisa- 
tion. Let us consider the contribution of each of these in turn. 

Aggregates. There is an essential difference in the way a group of records is charac- 
terised by FOL and by aggregates. FOL characterisation are based on the occurrence 
of one or more records in the group with certain properties. Aggregates on the other 
hand typically describe the group as a whole; each record has some influence on the 
value of the aggregate. The result of this difference is that FOL and aggregates pro- 
vide two unique feature-spaces to the learning procedure. Each feature-space has its 
advantages and disadvantages, and may be more or less suitable for the problem at 
hand. 

Although the feature-spaces produced by FOL and aggregates have entirely differ- 
ent characteristics, there is still some overlap. As was shown in section 4, some ag- 
gregates are very similar in behaviour to FOL expressions. The common features in 
the two spaces typically 

• select one or a few records in a group (min and <, max and >, count > 0 for 
some condition). 

• involve a single attribute: w^ < 1 

• have a relatively low variable-depth. 

If these properties hold, aggregate-based learning procedures will generally per- 
form better, as they can dispose of the common selective aggregates, as well as the 
complete aggregates such as sum and avg. 

Datamodels with a low variable depth are quite common in database design, and 
are called star-shaped (d^ = 1) or snowflake schemata (d^ >1). The Musk dataset is 
the most simple example of a star-shaped model. The datamodel of Mutagenesis con- 
sists for a large part of a star-shaped model, and Financial is essentially a snowflake 
schema. Many real-world datasets described in the literature as ILP applications es- 
sentially have such a manageable structure. Moreover, results on these datasets fre- 
quently exhibit the extra condition of w^, < 1 . Some illustrative examples are given in 
[4, 13]. 

Propositionalisation. According to lemma 2, Polka has the ability to discovery pat- 
terns that have a bigger clause-depth than either of its steps has. This is demonstrated 
by the experiments with our particular instance of Polka. RollUp produces variable- 
deep and thus clause-deep features. These clause-deep features are combined in the 
decision tree. Some leafs represent very clause-deep patterns, even though their sup- 
port is still sufficient. This is an advantage of Polka (propositionalisation + proposi- 
tional learning) over multi-relational algorithms in general. 

Next to advantages related to expressivity, there are more practical reasons for us- 
ing Polka. Once the propositionalisation-stage is finished, a large part of the computa- 
tionally expensive work is done, and the derived view can be analysed multiple times. 
This not only provides a greater efficiency, but gives the analyst more flexibility in 
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choosing the right modelling technique from a large range of well-developed com- 
mercially available set of tools. The analyst can vary the style of analysis (trees, rules, 
neural, instance-based) as well as the paradigm (classification, regression). 

8 Conclusion 

We have presented a method that uses aggregates to propositionalise a multi- 
relational database, such that the resulting view can be analysed by existing proposi- 
tional methods. The method uses information from the datamodel to guide a process 
of repeated summarisation of tables on other tables. The method has shown good 
performance on three well-known datasets, both in terms of accuracy as well as in 
terms of speed and flexibility. 

The experimental findings are supported by theoretical results, which indicate the 
strength of this approach on so-called star-shaped or snowflake datamodels. We have 
also given evidence for why propositionalisation approaches in general may outper- 
form ILP or MRDM systems, as was suggested before in the literature [4, 12]. 
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Abstract. Several algorithms that generate the set of all formal concepts and 
graphs of line (Hasse) diagrams of concept lattices are considered. Some modi- 
fications of well-known algorithms are proposed. Algorithmic complexity of 
the algorithms is studied both theoretically (in the worst case) and experimen- 
tally. Conditions of preferable use of some algorithms are given in terms of 
density/sparsity of underlying formal contexts. 



1 Introduction 

Concept lattices proved to be a useful tool for machine learning and knowledge dis- 
covery in databases [3, 6, 9, 19, 22-24]. The problem of generating the set of all con- 
cepts and the diagram graph of the concept lattice is extensively studied in the litera- 
ture [2-5, 7, 10, 11, 13, 16, 18-20]. It is known that the number of concepts can be 
exponential in the size of the input context (e.g., when the lattice is a Boolean one) 
and the problem of determining this number is #P-complete [15]. Therefore, from the 
standpoint of the worst-case complexity, an algorithm can be considered optimal if it 
generates the concept lattice in time and space linear in the number of all concepts 
(modulo a factor polynomial in the input size). On the other hand, “dense” contexts, 
which realize the worst case hy bringing about exponential number of concepts, may 
occur not often in practice. Moreover, various implementation issues, such as dimen- 
sion of a “typical” context, specificity of the operating system used, and so on, may 
he crucial for the practical evaluation of algorithms. In this article, we consider, both 
theoretically and experimentally, several algorithms that generate concept lattices for 
clearly specified data sets. In most cases, it was possible to improve the original ver- 
sions of the algorithms. We present modifications of some algorithms and indicate 
conditions when some of them perform better than the others. Only a few known al- 
gorithms generating the concept set construct the graph of the line diagram. We modi- 
fied some algorithms so that they can construct graphs of line diagrams. 

The first comparative study of four algorithms constructing the concept set and the 
graph of the line diagram can be found in [13]. Descriptions of the algorithms are 
sometimes huggy and the description of the experimental tests lacks any information 
about data used for tests. The fact that the choice of an algorithm should depend on 
input data is not accounted for. Besides, only one of the algorithms considered in [13], 
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namely that of Bordat [2], constructs the graph of the line diagram; thus, it is hard to 
compare its performance with that of the other algorithms. 

A much more elaborate review can be found in [11] (where another algorithm is 
proposed). The authors of [11] consider algorithms that generate the graph of the line 
diagram. Algorithms that were not originally designed for this purpose are extended 
by the authors. Such extensions are not always efficient; for example, the time com- 
plexity of the version of the Ganter algorithm (called Ganter-Allaoui) dramatically 
increases with the growth of the context size. This drawback can be cancelled by the 
use of binary search in the list produced by the original Ganter algorithm. Tests were 
conducted only for contexts with small number of attributes per object as compared to 
the number of all attributes. Our experiments (we consider more algorithms) also 
show that the algorithm proposed in [11] performs better on such contexts than the 
others do [17]. However, for “dense” contexts, this algorithm performs worse than 
some other algorithms (details are found in [17]). 

The paper is organized as follows. In Section 2, we give main definitions and an 
example. In Section 3, we give a survey of batch and incremental algorithms for con- 
structing concept lattices and analyze their worst-case complexity. In Section 4, we 
consider results of experimental comparison of the algorithms. 



2 Main Definitions 

First, we introduce standard FCA notation [8], which will be used throughout the pa- 
per. 

A (formal) context is a triple of sets (G, M, I), where G is called a set of objects, M 
is called a set of attributes, and I ^ G X M. For A c G and B ^ M: A' = {m & Ml 
VgeA {glm)}\ B' = {g e G I \/meB (gim)}. " is a closure operator, i.e., it is mono- 
tone, extensive, and idempotent. A (formal) concept of a formal context (G, M, /) is a 
pair (A, B), where A ^ G, B ^ M, A' = B, and B' = A. The set A is called the (formal) 
extent and B the (formal) intent of the concept (A, B). For a context (G, M, 1), a con- 
cept X = (A, B) is less general than or equal to a concept Y = (C, D) (ox X <Y) if A ^ 
C or, equivalently, D ^ B. Suppose that X and Y are concepts, X < Y, and there is no 
concept Z such that Z X, Z je Y, X < Z < Y. Then X is called a lower neighbor of Y, 
and Y is called an upper neighbor ofX. This relationship is denoted hy X <Y. The set 
of all concepts of a formal context forms a complete lattice L [8]. The graph of the 
line diagram of a concept lattice (or simply a diagram graph) is the directed graph of 
the relation A. The line diagram is a plane embedding of a diagram graph where each 
concept vertex is always drawn above all its lower neighbors (thus, the arrows on the 
arcs become superfluous and can be omitted). 



Example 1. Below we present a formal context with some elementary geometric 
figures and its line diagram. We shall sometimes omit parentheses and write, e.g., 12 
instead of [1, 2}. 
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G\M 


a = 4 vertices 


b = 3 vertices 


c = has a right 
angle 


d = all sides are 
equal 




! 




! 


! 


2 1 1 


! 




! 




B9 




! 


! 








! 




! 



Fig. 1. A formal context 




Fig. 2. The line diagram for the context from Fig. 1 

Data structures that realize concept sets and diagram graphs of concept lattices are 
of great importance. Since their sizes can be exponentially large w.r.t. the input size, 
some their natural representations are not polynomially equivalent, as it is in the case 
of graphs. For example, the size of the incidence matrix of a diagram graph is quad- 
ratic w.r.t. the size of the incidence list of the diagram graph and thus cannot be re- 
duced to the latter in time polynomial w.r.t. the input. Moreover, some important op- 
erations, such as find_concept, are performed for some representations (spanning 
trees [2, 10], ordered lists [7], CbO trees [16], 2-3 trees [1]) in polynomial time, but 
for some other representations (unordered lists) they can be performed only in expo- 
nential time. A representation of a concept lattice can be considered reasonable if its 
size cannot be exponentially compressed w.r.t. the input and allows the search for a 
particular concept in time polynomial in the input. 

All the algorithms can be divided into two categories: incremental algorithms [3, 5, 
1 1 , 20] , which, at the ith step, produce the concept set or the diagram graph for i first 
objects of the context, and batch ones, which build the concept set or its diagram 
graph for the whole context from scratch [2, 4, 7, 16, 18, 25]. Besides, any algorithm 
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typically adheres to one of the two strategies: top-down (from the maximal extent to 
the minimal one) or bottom-up (from the minimal extent to the maximal one). 

In many cases, we attempted to improve the efficiency of the original algorithms 
presented below. Only some of the original versions of the algorithms construct the 
diagram graph [2, 11, 18, 21]; it turned out that the other algorithms could be ex- 
tended to construct the diagram graph within the same worst-case time complexity 
bounds. Some algorithms are given the name of their authors. 

In the next section, we will discuss worst-case complexity bounds of the consid- 
ered algorithms. Due to the possibility of the exponential output of the algorithms, it 
is reasonable to estimate their complexity not only in terms of the input and output 
size, but also in terms of (cumulative) delay. Recall that an algorithm for listing a 
family of combinatorial structures is said to have polynomial delay [14] if it executes 
at most polynomially many computation steps before either outputting each next 
structure or halting. Note that the worst-case complexity of an algorithm with poly- 
nomial delay is a linear function of the output size modulo some factor polynomial in 
the input size. A weaker notion of efficiency of listing algorithms was proposed in 
[12]. An algorithm is said to have a cumulative delay d if it is the case that at any 
point of time in any execution of the algorithm with any input p the total number of 
computation steps that have been executed is at most d(p) plus the product of d(p) and 
the number of structures that have been output so far. If d(p) can be bounded by a 
polynomial of p, the algorithm is said to have a polynomial cumulative delay. 



3 Algorithms: A Survey 

Here we give a brief version of the survey found in [17]. First, we consider batch al- 
gorithms. The top-down algorithm Ml-tree from [25] generates the concept set, but 
does not build the diagram graph. In Ml-tree, every new concept is searched for in 
the set of all concepts generated so far. The top-down algorithm of Bordat [2] uses a 
tree (a “trie,” cf. [1]) for fast storing and retrieval of concepts. Our version of this al- 
gorithm uses a technique that requires O(IMI) time to realize whether a concept is 
generated for the first time without any search. An auxiliary tree, which is actually a 
spanning tree of the diagram graph, is used to construct the latter. Ch((A, B)) is the set 
of children of the concept (A, B) in this tree; it consists of the lower neighbors of (A, 
B) generated for the first time. 



Bordat 




0 . 


. L := 0 




1 . 


. Process ( (G, G ' ) , 


G') 


2 . 


. L is the concept 


set . 


Process ( (A, B) , C) 




1 . 


. L : = L U { (A, B) } 




2 . 


. L17 := LowerNeighbors ((A, 


3 . 


For each (D, E) e 


LN 




3.1. If CnB=B 
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3.1.1. 

3.1.2. 

3.1.3. 

3.2 Else 
3.2.1. 
3.3. {A, 



C := C U E 
Process ( {D, E) , C) 

Ch {{A, B)) : = Ch ( {A, B) ) U 

Find((G, G'), {D, E)) 

B) is an upper neighbor of 



{ (D, E) } 
{D, E) 



The full version of the algorithm can he found in [17]. The time complexity of the 
algorithm is O(IGIIMplLI); its polynomial delay is O(IGIIMp). 

The well-known algorithm proposed by Ganter computes closures for only some of 
subsets of G and uses an efficient canonicity test, which does not address the list of 
generated concepts. The subsets are considered in lexicographic order [7, 8]. The 
Ganter algorithm has polynomial delay O(IGPlMI). Its complexity is O(IGPlMIILI). 

The Close by One (CbO) algorithm uses a notion of canonicity similar to that of 
Ganter and a similar method for selecting subsets. It employs an intermediate struc- 
ture that helps to compute closures more efficiently using the generated concepts. 
Objects are assigned numbers; g holds whenever the number of g is smaller than 
that of h. The CbO algorithm obtains a new concept from a concept (A, B) generated 
at a previous step by intersecting B with the intent of an object g that does not belong 
to A. The generation is considered canonical if the intersection is not contained in any 
object from G \ A with smaller number than that of g. The algorithm repeatedly calls 
Process({g},g, ({g}", {gj j)) for each object g. 

Process {A, g, {C, D)) C = A' D = A' 

1. If {h I he C \ A Sl h ^ g} = 0 

1.1. L : = L u { (C, D)] 

1.2. For each f e [h \ he G \ C & g h) 

1.2.1. E := Cu {f} 

1.2.2. Y := Dn {f} ' 

1.2.3. X := ¥' (= Eu {h \ he G \ Z Sc Yc {h} '}) 

1.2.4. Process (E, f, {X, Y) ) 

The CbO algorithm has polynomial delay O(IGPlMI) and complexity O(IGPlMIILI). 
To construct the diagram graph with the CbO algorithm, we use a tree, which is not a 
spanning tree of the diagram graph, but it agrees with the concept order. 

The idea of the bottom-up algorithm in [18] is to generate the bottom concept and 
then, for each concept that is generated for the first time, generate all its upper 
neighbors. Lindig uses a tree of concepts that allows one to check whether some con- 
cept was generated earlier. The description of the tree is not detailed in [18], but it 
seems to be the spanning tree of the inverted diagram graph (i.e., with the root at the 
bottom of the diagram), similar to the tree from Bordat. Finding a concept in such a 
tree takes O(IGIIMI) time. In fact, this algorithm may be regarded as a bottom-up ver- 
sion of the Bordat algorithm. The time complexity of the algorithm is O(IGPlMIILI). 
Its polynomial delay is O(IGPlMI). 

The Al-tree [25] and Cbein [4] algorithms operate with extent-intent pairs and 
generate each new concept intent as the intersection of intents of two generated con- 
cepts. At every iteration step of the Cbein algorithm, a new layer of concepts is cre- 
ated by intersecting pairs of concept intents from the current layer and the new intent 
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is searched for in the new layer. We introduced several modifications [17] that made 
it possible to improve the performance of the algorithm. The time complexity of the 
modified algorithm is O(IGPlMIILI); its polynomial delay is O(IGPlMI). 

Now we consider incremental algorithms, which cannot have polynomial delay. 
Nevertheless, all algorithms below have cumulative polynomial delay. 

L. Nourine [21] proposes an algorithm for the construction of the lattice using a 
lexicographic tree with the best known worst-case complexity bound 0((IGI -H 
IMI)IGIILI). Edges of the tree are labeled with attributes, and nodes are labeled with 
concepts whose intents consist of the attributes that label the edges leading from the 
root to the node. Clearly, some nodes do not have labels. First, the tree is constructed 
incrementally (similar to the Norris algorithm presented below). An intent of a new 
concept C is created by intersecting an object intent g' and the intent of a concept D 
created earlier, and the extent of C is formed by adding g to the extent of D; this takes 
0(IMI + IGI) time. A new concept is searched for in the tree using the intent of the 
concept as the key; this search requires O(IMI) time. When the tree is created, it is 
used to construct the diagram graph. For each concept C, its parents are sought for as 
follows. Counters are kept for every concept initialized to zero at the beginning of the 
process. For each object, the intersection of its intent and the concept intent is pro- 
duced in O(IMI) time. A concept D with the intent equal to this intersection is found in 
the tree in O(IMI) time and the value in the counter increases; if the counter is equal to 
the difference between the cardinalities of the concepts C and D (i.e., the intersection 
of the intent of C and the intent of any object from D outside C is equal to the intent 
of D), the concept £) is a parent of C. 

The algorithm proposed by E. Norris [20] can be considered as an incremental ana- 
logue of the CbO algorithm. The concept tree (which is useful only for diagram con- 
struction) can be built as follows: first, there is only the dummy root; examine objects 
from G and for each concept of the tree check whether the object under consideration 
has all the attributes of the concept intent; if it does, add it to the extent; otherwise, 
form a new node and declare it a child node of the current one; the extent of the corre- 
sponding concept equals the extent of the parent node plus the object being examined; 
the intent is the intersection of this object intent and the parent intent; next, test the 
new node for the canonicity; if the test fails, remove it from the tree. The original ver- 
sion of the Norris algorithm from [20] does not construct the diagram graph. In this 
case, Norris is preferable to CbO, as the latter has to remember how the last concept 
was generated; this involves additional storage resources, as well as time expenses. 
The Norris algorithm does not maintain any auxiliary structure. Besides, the closure 
of an object set is never computed explicitly. 

The algorithm proposed by Godin [11] has the worst-case time complexity quad- 
ratic in the number of concepts. This algorithm is based on the use of an efficiently 
computable hash-function / (which is actually the cardinality of an intent) defined on 
the set of concepts. 

C. Dowling proposed an incremental algorithm for computing knowledge spaces 
[5]. A dual formulation of the algorithm allows generation of formal concepts. The 
worst-case complexity of the algorithm is O(IMIIGPlLI), the constants in this upper 
bound are large and in practice, the algorithm performs worse than other algorithms. 




Algorithms for the Construction of Concept Lattices and Their Diagram Graphs 



295 



4 Results of Experimental Tests 

The algorithms were implemented in C++. The tests were run on a Pentium 11-300 
computer, 256 MB RAM. Here, we present a number of charts that show how the 
execution time of the algorithms depends on various parameters. More charts can be 
found in [17]. 
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Fig. 3. Concept set: IMI = 100; Ig'l = 4 
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Fig. 4. Diagram graph: IMI = 100; Ig'l = 4. 
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Fig. 6. Diagram graph: IMi = 100; Ig'l = 25 

The Godin algorithm (and GodinEx, which is the version of the Godin algorithm 
using the cardinality of extents for the hash-function) is a good choice in the case of a 
sparse context. However, when contexts become denser, its performance decreases 
dramatically. The Bordat algorithm seems most suitable for large contexts, especially 
if it is necessary to build the diagram graph. When IGI is small, the Bordat algorithm 
runs several times slower than other algorithms, but, as IGI grows, the difference be- 
tween Bordat and other algorithms becomes smaller, and, in many cases, Bordat 
finally turns out to be the leader. For large and dense contexts, the fastest algorithms 
are bottom-up canonicity-based algorithms (Norris, Ganter, CbO). 
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Fig. 7. Concept set: IMI = 100; Ig'l = 50 




Fig. 8. Diagram graph: IMI = 100; Ig'l = 50 

It should be noted that the Nourine algorithm with the best worst-case time com- 
plexity did not show the best performance in our experiments: even when contexts of 
the form (G, G, were processed (which corresponds to the worst case of Boolean 
concept lattice), it was inferior to the Norris algorithm. Probably, this can be ac- 
counted to the fact that we represent attribute sets by bit strings, which allows very 
efficient implementation of set-theoretical operations (32 attributes per one processor 
cycle); whereas searching in the Nourine-style lexicographic tree, one still should 
individually consider each attribute labeling edges. 
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Figures 9-10 show the execution time for the contexts of the form (G, G, with 
2'^ concepts. 
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Fig. 10. Diagram graph: contexts of the form (G, G, 7 ^) 
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5 Conclusion 

In this work, we attempted to compare some well-known algorithms for constructing 
concept lattices and their diagram graphs. A new algorithm was proposed in [22] 
quite recently, so we could not include it in our experiments. Its worst-time complex- 
ity is not better than that of the algorithms described above, but the authors report on 
its good practical performance for databases with very large number of objects. Com- 
paring the performance of this algorithm with those considered above and testing the 
algorithms on large databases, including “classical” ones, will be the subject of the 
further work. We can also mention works [3, 19] where algorithms were applied for 
learning and data analysis, e.g., in [19] a Bordat-type algorithm was used. The de- 
scription of the algorithm in [3] does not give details about the test for uniqueness of a 
generated concept, i.e., whether it is already in the list. As we have seen, this test is 
crucial for the efficiency of an algorithm. 

The choice of the algorithm for construction of the concept set and diagram graph 
should be based on the properties of the input data. The general rule is as follows: the 
Godin algorithm should be used for small and sparse contexts; for dense contexts, the 
algorithms based on canonicity tests, linear in the number of output concepts, such as 
Close by One, Norris, and Ganter, should be used. The Bordat performs well on 
contexts of average density, especially, when the diagram graph is to be constructed. 

As mentioned above, the experimental comparison of execution times of algo- 
rithms is implementation-dependent. To reduce this dependence, we implemented a 
program that made it possible to compare algorithms not only in the execution time, 
but also in the number of operations performed. Such comparison is both more reli- 
able and more helpful, as it allows choosing an algorithm based on the computational 
complexity of the operations in particular implementation. 
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Abstract. Large amount of available information does not necessarily imply 
that induction algorithms must use all this information. Samples often provide 
the same accuracy with less computational cost. We propose several effective 
techniques based on the idea of progressive sampling when progressively larger 
samples are used for training as long as model accuracy improves. Our sam- 
pling procedures combine all the models constructed on previously considered 
data samples. In addition to random sampling, controllable sampling based on 
the boosting algorithm is proposed, where the models are combined using a 
weighted voting. To improve model accuracy, an effective pruning technique 
for inaccurate models is also employed. Finally, a novel sampling procedure for 
spatial data domains is proposed, where the data examples are drawn not only 
according to the performance of previous models, but also according to the spa- 
tial correlation of data. Experiments performed on several data sets showed that 
the proposed sampling procedures outperformed standard progressive sampling 
in both the achieved accuracy and the level of data reduction. 



1 Introduction 

Many existing data analysis algorithms require all the data to be resident in a main 
memory, which is clearly untenable in many large databases nowadays. Even fast data 
mining algorithms designed to run in a main memory with a linear asymptotic time 
may be prohibitively slow, when data is stored on a disk, due to the many orders of 
magnitude difference between main and secondary memory retrieval time. 

While data mining methods are faster when used on smaller data sets, the demand 
for accurate models often requires the use of large data sets that allow algorithms to 
discover complex structure and make accurate parameter estimates. Therefore, one of 
the most important data mining problems is to determine a reasonable upper bound of 
the data set size needed for building sufficiently accurate model. Oates and Jensen [1] 
found that increasing the amount of data used to build a model often results in a linear 
increase in model size, even when additional complexity causes no significant increase 
in model accuracy. Despite the promise of the better parameter estimation, models 
built with large amounts of data are often needlessly complex and cumbersome. 

Data reduction can also be extremely helpful for data mining from very large dis- 
tributed databases. In the contemporary data mining community, the majority of the 
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work for learning in a distributed environment considers only two possibilities: mov- 
ing all data into a centralized location for further processing, or leaving all data in 
place and producing local predictive models, which are later moved and combined via 
one of the standard machine learning methods [2]. With the emergence of new high- 
cost networks and huge amounts of collected data, the former approach may be too 
expensive, while the latter too inaccurate. Therefore, reducing the size of databases by 
several orders of magnitude and without loss of extractable information could speed 
up the data transfer for a more efficient and a more accurate centralized learning. 

In this paper we propose a novel technique for data reduction based on the idea of 
progressive sampling [3]. Progressive sampling starts with a small sample in an initial 
iteration and uses progressively larger ones in subsequent iterations until model accu- 
racy no longer improves. As a result, a near-optimal minimal size of the data set 
needed for efficient learning an acceptably accurate model is identified. Instead of 
constructing a single predictor on identified data set, our approach attempts to reuse 
the most accurate and sufficiently diverse classifiers built in sampling iterations and to 
combine their predictions. In order to further improve achieved prediction accuracy, 
we propose a weighted sampling, based on a boosting technique [4], where the predic- 
tion models in subsequent iterations are built on those examples on which the previous 
predictor had poor performance. Similar techniques of active or controllable sampling 
are related to windowing [5], wherein subsequent sampling chooses training instances 
for which the current model makes the largest errors. However, simple active sam- 
pling is notoriously ill behaved on noisy data, since subsequent samples contain in- 
creasing amount of noise and performance often decrease as sampling progresses [6]. 

In addition, both the number and the size of spatial databases are rapidly growing, 
because huge amounts of data have been collected in various GIS applications ranging 
from remote sensing and satellite telemetry systems, to computer cartography and 
environmental planning. Therefore the data reduction of very large spatial databases is 
of fundamental importance for efficient spatial data analysis. Hence, in this paper we 
also propose the method for efficient progressive sampling of spatial databases, where 
the sampling procedure is controlled not only by the accuracy of previous prediction 
models but also by considering spatially correlated data points. In our approach, the 
data points that are highly spatially correlated are not likely to be sampled together in 
the same sample, since they bear less useful data information than two non-correlated 
data points. The objective of this approach is to further reduce the size of spatial data 
set and to allow more efficient learning in such domains. 

The proposed sampling methods applied to several very large data sets indicate that 
the both a general purpose and a spatial progressive sampling technique can learn 
faster than the standard progressive sampling [3], and also can outperform the stan- 
dard progressive sampling in the achieved prediction accuracy. 



2 Progressive Sampling 

Given a data set with N examples, our goal is to determine its minimal size for 
which we aim to achieve a sufficiently accurate prediction model. The modification of 
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geometric progressive sampling [3] is used in order to maximize accuracy of learned 
models. The central idea of the progressive sampling is to use a sampling schedule: 

S={no,ni,n2,n3,...,nt] ( 1 ) 

where each «, is an integer that specifies the size of a sample to be provided to a train- 
ing algorithm at iteration i. Here, the n, is defined as: 

M; = no ■ d (2) 

where a is a constant which defines how fast we increase the size of the sample pre- 
sented to an induction algorithm during sampling iterations. The relationship between 
sample size and model accuracy is depicted by a learning curve (Fig. 1). The horizon- 
tal axis represents n, the number of instances in a given training set, that can vary 
between zero and the maximal number of instances N. The vertical axis represents the 
accuracy of the model produced by a training algorithm when given a training set with 
n instances. Learning curves typically have a steep slope portion early in the curve, a 
more gently sloping middle part, and a plateau late in the curve. The plateau occurs 
when adding additional data instances is not likely to significantly improve prediction. 
Depending on the data, the middle part and the plateau can be missing from the learn- 
ing curve, when N is small. Conversely, the plateau region can constitute the majority 
of curves when N is very large. In a recent study of two large business data sets, Har- 
ris-Jones and Haines [7] found that learning curves reach a plateau quickly for some 
algorithms, but small accuracy improvements continue up to N for other algorithms. 




Fig. 1. Learning curve 



The progressive sampling [3] was designed to increase the speed of inductive learn- 
ing by providing roughly the same accuracy and using significantly smaller data sets 
than available. We used this idea to further increase the speed of inductive learning for 
very large databases and also to attempt to improve the total prediction accuracy. 



3 Progressive Boosting 

The proposed progressive boosting algorithm is based on an integration of Ada- 
boost.M2 procedure [4] into the standard progressive sampling technique described at 
Section 2. The AdaBoost.M2 algorithm proceeds in a series of T rounds. In each 
round f, a weak learning algorithm is called and presented with a different distribution 
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D, that is altered by emphasizing particular training examples. The distribution is 
updated to give wrong classifications higher weights than correct classifications. The 
entire weighted training set is given to the weak learner to compute the weak hypothe- 
sis h,. At the end, all weak hypotheses are combined into a single hypothesis hf„. 

Instead of sampling the same number of data points at each boosting iteration t, our 
progressive boosting algorithm (Fig. 2) draws data points («, = ng-d'^) according to 
the sampling schedule S (equation 1). Therefore, we start with a small sample contain- 
ing Ug data points, and in each subsequent boosting round we increase the size of the 
sample used for learning a weak classifier L,. Each weak classifier produces a weak 
hypothesis h,. At the end of each boosting round t all weak hypotheses are combined 
into a single hypotheses H,. However, the distribution for drawing data samples in 
subsequent sampling iterations is still updated according to the performance of a sin- 
gle classifier constructed in the current sampling iteration. 

• Given: Set S {(xi, yO, ... , {xn, ym)} Xi eX, with labels y,- g Y = { 1, ..., C} 

• Let B = {(/, y): i = y y,}. Let t = 0. 

• Initialize the distribution Dj over the examples, such that Dj(i) = 1/N. 

• REPEAT 

1 . t = t -t 1 

2. Draw a sample Q, that contains ug ■ d~‘ data instances according to the 
distribution D,. 

3. Train a weak learner L, using distribution D, 

4. Compute the pseudo-loss of hypothesis h{. 

£t=^- ! D, (i, y)(l - h, (x,- ,y^ + h, (x,- , y)) 

^ 0'.y)eB 

5. Set /^, = e, / (I - £,) and w, = (l/2)il-h,{xi, y)+h,{xi, y,)) 

6. Update D, : D,+i ( i, y) = {D,{i,y)l Z,)- P, 

where Z, is a normalization constant chosen such that D,+j is a distribution. 

7. Combine all weak hypotheses into a single hypothesis: 

' 1 

H, = argmax 1 (log— ) ■ /i (x,y) 
j=X Pi 

• UNTIL (accuracy of H, is not significantly larger than accuracy of Hgi) 

8. - Sort the classifiers from ensemble according to their accuracy. 

- REPEAT removing classifiers with accuracy less than prespecified threshold 
UNTIL there is no longer improvement in prediction accuracy 

Fig. 2. The progressive boosting algorithm for data reduction 

We always stop the progressive sampling procedure when the accuracy of the hy- 
pothesis Ht, obtained in the t-th sampling iteration, lies in 95% confidence interval of 
the prediction accuracy of hypothesis Hgi achieved in the (t-l)-th sampling iteration: 

acc(//,) G \_acc{H,.i), acc(H,.j) + 1.645 • j ^ 3 ^ 

V N 






Data Reduction Using Multiple Models Integration 



305 



where acc{Hj) represents classification accuracy achieved by hypothesis Hj con- 
structed in y-th sampling iteration on the entire training set. 

It is well known in machine learning theory that an ensemble of classifiers must be 
both diverse and accurate in order to improve the overall prediction accuracy. Diver- 
sity of classifiers is achieved by learning classifiers on different data sets obtained 
through weighted sampling in each sampling iteration. Nevertheless, some of the clas- 
sifiers constructed in early sampling iterations may not be accurate enough due to 
insufficient number of data examples used for learning. Therefore, before combining 
the classifiers constructed in sampling iterations, we prune the classifier ensemble by 
removing all classifiers whose accuracy on a validation set is less than some prespeci- 
fied threshold until the accuracy of the ensemble no longer improves. A validation set 
is determined before starting the sampling procedure as a 30% sample of the entire 
training data set. Assuming that the entire training set is much larger than the reduced 
data set used for learning, our choice of the validation sets should not introduce any 
significant unfair bias, since only the small fraction of data points from the reduced 
data set are included in the validation set. When the reduced data set is not signifi- 
cantly smaller than the entire training set, the unseen separated test and validation sets 
are used for estimating the accuracy of the proposed methods. 

Since our goal is to identify a non-redundant representative subset, the usual way of 
drawing samples with replacement used in the AdaBoost.M2 procedure cannot be 
employed here. Therefore, the reminder stochastic sampling without replacement [8] 
is used, where the data examples cannot be sampled more than once. Therefore, as a 
representative subset we obtain a set of distinct data examples with no duplicates. 



4 Spatial Progressive Boosting 

Spatial data represent a collection of attributes whose dependence is strongly related 
to a spatial location where observations close to each other are more likely to be simi- 
lar than observations widely separated in space. Explanatory attributes, as well as the 
target attribute in spatial data sets are very often highly spatially correlated. It is clear 
that data redundancy in spatial databases may be partially due to different reasons than 
in non-spatial data sets and therefore the standard sampling procedures may not be 
appropriate for spatial data sets. 

In the most common geographic information science (GIS) applications the fixed- 
length grid is regular and therefore the standard method to determine the degree of 
correlation between neighboring points in such spatial data is to construct a correlo- 
gram [9]. The correlogram represents a plot of the autocorrelation coefficient com- 
puted as a function of separation distance between spatial data instances (Fig. 3). One 
of the main characteristics of the spatial correlogram is its range, which corresponds to 
a distance where spatial dependency starts to disappear, e.g. where the absolute value 
of the correlogram drops somewhere around 0. 1 . 

Our spatial sampling procedure represents a modification of the proposed progres- 
sive boosting technique, described in Section 3. The general algorithm for progressive 
boosting, presented in Fig. 2 still remains the same, but the procedure for sampling the 
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data examples in subsequent sampling iterations according to the given distribution is 
adapted to the spatial domain data. In standard sampling without replacement [8] 
when the data example is sampled once, it cannot be sampled again. In our spatial 
modification of sampling procedure, when a data instance (shown as 0 in Fig. 4) is 
drawn once, not only that instance cannot be sampled again but also all its neighboring 
points, represented with and in Fig. 4. How many neighbors are excluded from 
further sampling depends on the degree of correlation and also on the number of data 
points required to be drawn in current sampling iteration. If the number of points 
needed to be sampled prevails the number of available data examples for sampling, 
the farthest square of points (examples denoted as in Fig. 4) is then included in the 
set of examples available for sampling. This allows a more uniform sampling across 
the spatial data set, while still concentrating on more difficult examples for learning. 

The spatial progressive boosting employs the same algorithm as one shown in 
Fig. 2, but uses our modified spatial sampling procedure. 




Fig. 3. A spatial correlogram with a 40 m range 



Fig. 4. The scheme for sampling data ex- 
amples in spatial data sets 



5 Experimental Results 

An important issue in progressive sampling based techniques is the type of the model 
used for training through iterations. We used non-linear 2-layer feedforward neural 
network (NN) models that generally have a large variance, meaning that their accuracy 
can largely differ over different weight’s initial conditions and choice of training data. 
In such situations using the progressive sampling procedure may effect in significant 
errors in the estimation of In order to alleviate the effect of neural network insta- 
bility in our experiments, the prediction accuracy is averaged over 20 trials of the 
proposed algorithms, i.e. the sampling procedures are repeated 20 times and accura- 
cies achieved at the same sampling round for all 20 trials are averaged. Since it is 
unlikely that the progressive sampling technique always stops at the same sampling 
iteration in each of these trials, we simply determined the number of sampling itera- 
tions in the first trial of progressive sampling technique, and all other trials for all 
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sampling variants were repeated for such identified number of sampling iterations. To 
investigate real generalization properties of built NN models, we tested our classifica- 
tion models on the entire training set and on an unseen data with a similar distribution. 

The number of hidden neurons in our NN models was equal to the number of input 
attributes. The NN classification models had the number of output nodes equal to the 
number of classes (3 in our experiments), where the class was predicted according to 
the output with the largest response. Resilient propagation (RP) [10] and Levenberg- 
Marquardt (LM) [11] algorithms were used for learning, although better prediction 
accuracies were achieved using the LM learning algorithm, and only those results are 
reported here. The LM algorithm is a variant of Newton’s method, where the ap- 
proximation of the Hessian matrix of mixed partial derivatives is obtained by averag- 
ing outer products of estimated gradients. This is very well suited for small to me- 
dium-size NN training through mean squared error minimization. 

We performed our experiments on several large data sets. The first data set was 
generated using our spatial data simulator [12] such that the distributions of generated 
data resembled the distributions of real life spatial data. A square shaped spatial data 
of size 5120 meters x 5120 meters sampled on a relatively dense spatial grid 
(lOmeters x 10 meters) resulted in 262,144 (512^) training instances. The obtained 
spatial data stemmed from a homogeneous distribution and had five continuous attrib- 
utes and three equal size classes. 

The second data set was Covertype data, currently one of the largest databases in 
the UCI Database Repository [13]. This spatial data set contains 581,012 examples 
with 54 attributes and 7 target classes and represents the forest cover type for 30 x 30 
meter cells obtained from US Forest Service (USFS) Region 2 Resource Information 
System [14]. In Covertype data set, 40 attributes are binary columns representing soil 
type, 4 attributes are binary columns representing wilderness area, and the remaining 
10 are continuous topographical attributes. Since training of a neural network classi- 
fier would be very slow if using all 40 attributes representing a soil type variable, we 
transformed them into 7 new ordered attributes. These 7 attributes were determined by 
computing relative frequencies of each of 7 classes in each of 40 soil types. Therefore, 
instead of using a single value for representing each soil type, we used a 7- 
dimensional vector with values that could be considered continuous and therefore 
more appropriate for use with neural networks. This resulted in the transformed data 
set with 21 attributes. 

The experiments were also performed on Waveform and LED data sets from the 
UCI repository [13]. For the Waveform set, 100,000 instances with 21 continuous 
attributes and three equally sized classes were generated, while for the LED data set 
50,000 examples were generated for training and 50,000 examples were generated for 
testing. Both training and test data sets had 7 binary attributes and 10 classes. 

We first performed progressive sampling on all data sets, where in the schedule 
given in equation (2) we used a = 2. Therefore, randomly chosen data samples in 
subsequent sampling iterations were always twice larger than samples drawn in the 
previous iterations. Since in our sampling procedures all classifiers constructed in all 
previous sampling iterations are saved and together with the classifier from the current 
iteration are combined, we also used the progressive bagging scheme, where the clas- 
sifiers constructed on randomly selected, progressively larger data samples were com- 
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bined into an ensemble using the same combining weights. Finally, we performed our 
proposed progressive boosting technique for data reduction on all sets. The improve- 
ment of classification accuracy during the sampling iterations on all considered data 
sets is shown at Fig. 5. 

In order to better compare our proposed sampling techniques with the progressive 
sampling, we stopped them in the same sampling iteration as we stopped the progres- 
sive sampling. In this way, we are able to examine two effects of data reduction tech- 
niques. First, we can observe what are the possible improvements in classification 
accuracy when the same size of data sample, necessary for constructing a sufficiently 
accurate model in progressive sampling, is used. Second, we are able to compare the 
level of data reduction by evaluating the sizes of data samples for which we achieve 
the same classification accuracy. The possible savings in processing time were not 
reported due to lack of space, although these savings are proportional to the level of 
data reduction since the time for training NN models is proportional to data set size. 
All results in Fig. 5 are shown starting from the second or third sampling iteration 
since all the methods achieved the similar accuracies in a first few iterations. 



Synthetic spatial data set 




Coverlype data set 




Waveform data set 





Fig. 5. The classification accuracy as a function of sample size for different progressive sam- 
pling techniques on four domains 
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Analyzing the charts in Fig. 5, it is evident that the sampling methods involving the 
proposed model integration showed improvements both in prediction accuracy and in 
achieved data reduction as compared to the standard progressive sampling. The im- 
provement in achieved final prediction accuracy was evident for synthetic spatial, 
Covertype and LED data set, while the experiments performed on Waveform data sets 
resulted in similar final prediction accuracy for all proposed variants of sampling 
techniques probably due to high homogeneity of data. However, during the sampling 
(iterations 3 to 7, Fig. 5) progressive boosting was consistently achieving better pre- 
diction accuracy than progressive bagging, although this difference was fairly small. 
The dominance of progressive boosting can be explained by the fact that the sampling 
procedure employed in progressive boosting attempted to rank sampling data exam- 
ples from those that are more difficult for learning to those that are easier. Therefore, 
all advantages of standard boosting were also integrated in our progressive boosting 
technique. 

It is also evident that for the same level of data reduction (the same sampling itera- 
tion that corresponds to training data of the same size) the achieved prediction accu- 
racy was significantly higher when using progressive boosting and even progressive 
bagging instead of standard progressive sampling (Fig. 5). In addition, the same pre- 
diction accuracy was achieved with much smaller data sets when using progressive 
boosting and bagging for data reduction instead of relaying on standard progressive 
sampling. For example, the prediction accuracy on the synthetic spatial data (Fig. 5a) 
that was achieved by progressive sampling technique with 65,664 examples (10 itera- 
tions), was also achieved by the progressive boosting with 8,208 examples (7 itera- 
tions). Hence, the gain of these three iterations was an about eight times smaller data 
set needed for progressive boosting as compared to the progressive sampling. 

The level of data reduction for different sampling techniques may be compared if 
we measure the minimum data sets needed for achieving the same accuracy. This 
prediction accuracy is determined when no further significant improvements in accu- 
racy, obtained by progressive sampling, is observed. For easier comparison, the size 
of a reduced data set used to obtain this accuracy by progressive sampling served as a 
basic reduction level, and then we compared the enhancements of other data reduction 
techniques. Table 1 shows the level of data reduction for three used data sets. 



Table 1. The size of the data sets used for successful learning and their percentage of the origi- 
nal data set size when different sampling techniques are employed 



Method i Data set^ 


Synthetic Spatial 


Covertype 


LED 


Waveform 


Progressive Sampling 


65,664 (25.1 %) 


32,768 (5.6 %) 


12,288 (25 %) 


9,984 (9.9%) 


Progressive Bagging 


16,416 (6.3 %) 


8,192 (1.4 %) 


3,072 (6.1 %) 


9,984 (9.9%) 


Progressive Boosting 


8,208 (3.1 %) 


8,192 (1.4 %) 


1,536 (3.1 % ) 


9,984 (9.9%) 



It is evident from Table 1 that both sampling methods with model integration 
achieved better reduction performance than the standard progressive sampling. In 
model integration methods the reduced data set was four to eight times smaller than 
the reduced data set identified through standard progressive sampling. The only ex- 
ception was the reduction of Waveform data sets (Table 1), where no additional re- 
duction was achieved by combining different classifiers again due to high homo- 
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geneity of data. Nevertheless, when employing progressive boosting and progressive 
bagging techniques, there is an additional requirement to store all the previously con- 
structed classifiers, or to save all data sets used for constructing these classifiers. Usu- 
ally, storing only the constructed classifiers is beneficial when employing an ensemble 
to make a prediction on an unseen data set with a similar distribution. However, very 
often there is a need for storing all necessary data examples needed for constructing 
all the classifiers. Since we use geometric progressive sampling, where the data sam- 
ple in subsequent sampling iteration is twice larger from the sample used at the previ- 
ous iteration, the total size of all previous data samples cannot be larger than the size 
of the data sample used in the current sampling iteration. Therefore, even in this case, 
according to Table 1 we can still achieve a better level of data reduction than the stan- 
dard progressive sampling. 

We also performed experiments with pruning inaccurate classifiers constructed in 
progressive boosting iterations (Fig. 6). For geometric sampling schedule we again 
used a = 2. When pruning inaccurate classifiers, we always eliminated those classifiers 
that harmed the overall classification accuracy on the validation set. Again, the accu- 
racies on the entire training set are shown starting from second iteration, since there 
was no pruning at the first iteration (Fig. 6). 



Synthetic spatial data set Covertype data set 





Fig. 6. The classification accuracy during the sampling iterations of progressive boosting and 
pruning progressive boosting 



Results from the experiments presented in Fig. 6 indicate that pruning progressive 
boosting outperformed the progressive boosting technique both in achieved accuracy 
and in the level of data reduction for synthetic spatial and Covertype data sets. The 
enhancements of pruning progressive boosting on Waveform and LED data sets was 
insignificant as compared to the progressive boosting technique, and therefore these 
results are not presented here. It is evident from Fig. 6 that for synthetic spatial and 
Covertype data set the same prediction accuracy may be achieved much faster when 
pruning classifiers than without pruning. For example, accuracy of 92% for synthetic 
spatial data set was achieved by progressive boosting without pruning with 65,664 
examples (iteration 11), while similar accuracy was achieved when pruning progres- 
sive boosting with 8,208 example (iteration 8), thus resulting in an eight times smaller 
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data set. The same results can be observed for Covertype data set, where pruning 
progressive boosting again caused eight times smaller data set for the comparable 
prediction accuracy. 

Finally, we performed the experiments for sampling spatial data using our proposed 
technique for spatial progressive boosting. Since the positions of data examples in- 
cluded in the form of x and y coordinates were only available for the synthetic spatial 
data set, but not for Covertype data set, the results are reported only for the synthetic 
spatial data (Fig. 7). The shown accuracy starts from the third sampling iteration due 
to similar performance of spatial and non-spatial sampling in the first two iterations. 




Fig. 7. The classification accuracy during the sampling iterations of spatial progressive boost- 
ing and standard progressive boosting on synthetic spatial data set 



Fig. 7 shows that the spatial progressive boosting method, starting from the fourth 
iteration outperformed the regular progressive boosting in achieved prediction accu- 
racy. In addition, for achieving accuracy of 92%, spatial progressive boosting needed 
four times smaller data set than the regular progressive boosting. One of the reasons 
for such a successful reduction of this data set is possibly in its high spatial correlation 
among observed attributes and a relatively dense spatial grid (10 x 10 meters). 



6 Conclusions 

Several new sampling procedures based on the progressive sampling idea are pro- 
posed. They are intended for an efficient reduction of very large and possibly spatial 
databases. Experimental results on several data sets indicate that the proposed sam- 
pling techniques can effectively achieve similar or even better prediction accuracy 
while obtaining a better data reduction than the standard progressive sampling tech- 
nique. Depending on the data set, accuracy comparable to relying on the whole data 
set was achieved using 1.4% to 6.1% of the original data. 

The question that naturally arises from this paper is a possible gain when compar- 
ing the proposed sampling techniques with the procedure of first performing the pro- 
gressive sampling and then applying some of the methods for combining classifiers 
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(bagging, boosting). First, our sampling techniques are faster since they do not require 
additional algorithm of combining classifiers. Second, our algorithms provide a better 
diversity of combined classifiers, since during the sampling iterations some of the 
instances difficult for learning were naturally included in the reduced data set hy our 
algorithms while these may not be included in a final data set when performing stan- 
dard progressive sampling. Finally, when using our algorithms, only a small number of 
data examples that are relatively easy for learning will he included in the reduced data 
set, unlike the progressive sampling where this number cannot be controlled. Our 
future work will address the significance of the difference between these two methods. 

One of the possible drawbacks of our proposed sampling techniques that will be 
also carefully investigated in our future work, is an increased time required for con- 
trolled sampling as compared to random sampling. For reduction of heterogeneous 
data sets we are currently experimenting with radial basis functions, while for spatial 
data reduction different similarity information will be explored. In addition, we are 
also extending the proposed methods to regression-hased problems. 
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Abstract. In essence, data mining consists of extracting knowledge from data. 
This paper proposes a co-evolutionary system for discovering fuzzy classifica- 
tion rules. The system uses two evolutionary algorithms: a genetic programming 
(GP) algorithm evolving a population of fuzzy rule sets and a simple evolution- 
ary algorithm evolving a population of membership function definitions. The 
two populations co-evolve, so that the final result of the co-evolutionary proc- 
ess is a fuzzy rule set and a set of membership function definitions which are 
well adapted to each other. In addition, our system also has some innovative 
ideas with respect to the encoding of GP individuals representing rule sets. The 
basic idea is that our individual encoding scheme incorporates several syntacti- 
cal restrictions that facilitate the handling of rule sets in disjunctive normal 
form. We have also adapted GP operators to better work with the proposed in- 
dividual encoding scheme. 



1 Introduction 

In the context of machine learning and data mining, one popular way of expressing 
knowledge consists of IF-THEN rules. This is due to the fact that they are intuitively 
comprehensible to a human being [5]. In addition, they represent independent units of 
knowledge, so that alterations can easily take place in their contents. IF-THEN rules 
are composed of two parts. The first part (IF component, or rule antecedent) corre- 
sponds to a conjunction of conditions that, if verified true, imply that the condition 
contained in the second part (THEN component, or rule consequent) is also consid- 
ered true. 

Rules in their classic format are appropriate when their conditions are constituted 
by discrete or categorical variables. However, the presence of continuous variables 
creates situations that thwart the common sense. Let's consider the rule: “IE age < 25 
THEN safe_driver = no”. The problem here is the sudden and unnatural transition 
between categories: an individual can be classified as not being a safe driver today 
but, in the following day, he might have completed 25 years and thus be classified as 
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being a safe driver. This could lead a data mining system to completely different pre- 
dictions in the interval of a single day. One promising alternative to work with con- 
tinuous variables and to overcome this inconvenience is the use of fuzzy logic. Be- 
sides expressing knowledge in a more natural way, fuzzy logic is also a flexible and 
powerful method for uncertainty management [13], [6]. 

In the literature several techniques have been used for discovery of fuzzy IF-THEN 
rules. Several recent projects have proposed the use of evolutionary algorithms for 
fuzzy rule discovery [2], [10], [11], [19], [21], [17], because it allows a global search 
in the state space, increasing the probability of converging to the globally-optimal 
solution. 

The main characteristic of our proposed system that makes it different from the 
above systems it that our system is based on the co-evolution of fuzzy rule sets and 
membership function definitions, using two separate populations, whereas in general 
the above projects are based on the evolution of a single population. The population of 
fuzzy rule sets is evolved by a Genetic Programming (GP) algorithm, whereas the 
population of membership function definitions is evolved by a simple evolutionary 
algorithm. 

In addition, our system also has some innovative ideas with respect to the encoding 
of GP individuals representing rule sets. The basic idea is that our individual encoding 
scheme incorporates several syntactical restrictions that facilitate the handling of rule 
sets in disjunctive normal form. We have also adapted GP operators to better work 
with the proposed individual encoding scheme. 

The remainder of this paper is organized as follows. Section 2 describes in detail 
our proposed co-evolutionary system. Section 3 discusses computational results. Fi- 
nally, section 4 concludes the paper. 



2 The Proposed Co-evolutionary System for Fuzzy Rule Discovery 

2.1 An Overview of the System 

This section presents an overview of our CEFR-MINER (Co-Evolutionary Fuzzy Rule 
Miner) system. CEFR-MINER is a system developed for the classification task of data 
mining. It consists of two co-evolving evolutionary algorithms. The first one is a Ge- 
netic Programming (GP) algorithm where each individual represents a fuzzy rule set. 
A GP individual specifies only the attribute-value pairs composing the rule conditions 
of that individual’s rule set. The definitions of the membership functions necessary to 
interpret the fuzzy rule conditions of an individual are provided by the second popula- 
tion. The second algorithm is a simple evolutionary algorithm, which works with a 
“population” of a single individual. This population evolves via the principle of natu- 
ral selection and application of mutation, but not crossover. This single individual 
specifies definitions of all the membership functions for all attributes being fuzzified 
(all originally continuous attributes). These definitions are used by the first population 
of GP individuals, as mentioned above. Note that categorical attributes are not fuzzi- 
fied - their values are handled only by the GP population. 
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As a result, the system simultaneously evolves both fuzzy rule sets and membership 
function definitions specifically suited for the fuzzy rule sets. The main advantage of 
this co-evolutionary approach is that the fitness of a given set of membership function 
definitions is evaluated across several fuzzy rule sets, encoded into several different 
GP individuals, rather than on a single fuzzy set. This improves the robustness of that 
evaluation. 

This basic idea of co-evolution for fuzzy-rule discovery has been recently proposed 
by [4]. The main differences between this work and our system are as follows, (a) 
Delgado et al.’s work uses three co-evolving populations and our work uses only two; 
(b) Delgado et al.’s work uses genetic algorithms for evolving two of its three popula- 
tions. By contrast, we use genetic programming to evolve the rule set population; (c) 
our work addresses the classification task of data mining, whereas Delgado et al.’s 
work addresses the problem of numeric function approximation. 



2.2 The Genetic Programming Population 
2.2.1 Rule Representation 

Our system follows the Pittsburgh approach [7] and thus each individual represents a 
set of rules. Each rule has the form; IF conditions THEN prediction. The prediction of 
the rule has the form: “goal attribute = class”, where class is one of the values that can 
be taken on by the goal attribute. In each run of the system all individuals of the GP 
population are associated with the same prediction. Therefore, there is no need to 
explicitly encode this prediction into the genome of an individual. Since each run 
discovers rules predicting a single class, the system must be run c times, where c is the 
number of classes. Although this approach increases processing time, it has two im- 
portant advantages: (a) it simplifies individual encoding; and (b) it avoids the problem 
of mating between individuals that predict different classes, which could produce low- 
quality offspring. Each individual actually corresponds to a set of rule antecedents 
encoded in disjunctive normal form (DNF), such as: (sore throat = true AND age = 
low) OR (headache = true AND NOT temperature = low). 

In our system the function set contains the logical operators {AND, OR, NOT}. 
Since each individual represents fuzzy rules, a fuzzy version of these logical operators 
must be used. We have used the standard fuzzy AND (intersection), OR (union) and 
NOT (complement) operators [13]. More precisely, let |J,A(-t:) denote the membership 
degree of an element x in the fuzzy set A, i.e. the degree to which x belongs to the 
fuzzy set A. The standard AND of two fuzzy sets A and B, denoted A AND B, is de- 
fined as IXA-AND-B(-t^) = min[|j,A('^).ltB('t:)], where min denotes the minimum operator. 
The standard OR of two fuzzy sets A and B, denoted A OR B, is defined as [Xa-or-b (x) 
= max[|j,A(-t:),|XB(-x)], where max denotes the maximum operator. The standard NOT of 
a fuzzy set A, denoted NOT A, is defined as |J,not-a(x:) = 1 - |Xa(-x)- 

The terminal set consists of all possible conditions of the form: “Attri = Valy”, 
where Attp is the i-th attribute of the dataset. If attribute Athi is categorical, Valjj is the 
j-th value of the domain of Athi. If attribute Atfri is continuous - which means it is 
being fuzzified by the system - Valy is a linguistic value in {low, medium, high}. We 
use only three linguistic values in order to reduce the size of the search space. 




Discovering Fuzzy Classification Rules with Genetic Programming and Co-evolution 



317 



In order to produce individual trees with only valid rule antecedents and in DNF we 
propose some syntactic restrictions in the tree representation, as follows: (a) the root 
node is always an OR node; (b) with the exception of the root node, each OR node 
must have as its parent another OR node, and can have as its children any kind of 
node; (c) each AND node must have as its parent either an OR node or another AND 
node, and can have as its children AND, NOT or terminal nodes; (d) a NOT node can 
have as its parent an OR node, an AND node or a NOT node; and it can have as its 
child either another NOT node or a terminal node (we allow conditions of the form 
“NOT NOT ...” to allow the possibility of a NOT being cancelled by another NOT as 
a result of genetic operators) and (e) there cannot be two or more terminal nodes refer- 
ring to the same attribute in the same rule antecedent, since this would tend to produce 
invalid rule antecedents such as (sex = male AND sex = female). Fig. 1 shows an 
individual with five rule antecedents. 

These syntactic constraints are enforced both when creating individuals of the ini- 
tial population and when modifying individuals due to the action of a genetic operator. 
This approach can be regarded as a kind of strongly-typed GP [16] proposed specifi- 
cally for the discovery of rule sets in disjunctive normal form, which makes it attrac- 
tive for data mining applications. 



(cm) 




The main advantage of working with the DNF directly into the tree representation, 
rather than converting a rule set into DNF after GP has evolved, is that this makes it 
easier to fulfil the aforementioned restriction (e). Because of the hierarchical position 
of the nodes, it is easy to collect the terminal nodes of an individual rule, as shown in 
Fig. 1, in order to check whether or not a condition can be inserted into that rule. 

Another possible approach to assure that the GP will run only with syntactically 
valid individuals would be to use a context-free grammar to implement the aforemen- 
tioned syntactic restrictions. The drawback of this approach would be the difficulty in 
checking syntactic restriction (e), which would lead to an explosion of the number of 
production rules in the grammar. To avoid this, a logic grammar could be used [20], 
but this would introduce some complexity to the system. Thus, we have preferred the 
above-described direct implementation of syntactic constraints. 
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2.2.2 Selection and Genetic Operators 

We use the tournament selection method, with tournament size 2 and with a simple 
extension: if two individuals have the same fitness, the one with smaller complexity is 
selected. Complexity is measured hy the following formula [12]: 

complexity = 2 X number_of_rules + numher_of_conditions . (1) 

This extension was motivated by observations in our experiments: sometimes the 
two individuals competing in the tournament had the same fitness value, even though 
they were different individuals. 

Once two individuals are selected crossover is performed in a similar way to con- 
ventional GP crossover, with the difference that in our case the crossover operator 
respects the above-discussed syntactic restrictions, in order to guarantee that crossover 
always generates syntactically-valid offspring. (If crossover cannot produce syntacti- 
cally-valid individuals, the crossover operation fails and no children are produced.) 

The current version of the system uses a crossover probability of 80%, a relatively 
common setting in the literature. However, in our system the offspring produced by 
crossover is not necessarily inserted into the population. Our population updating 
strategy is as follows. Once all crossovers have been performed, all the produced 
offspring are added to the population of individuals. Therefore, the population size is 
provisionally increased by 80%. Then all individuals are sorted by fitness value, and 
the worst individuals are removed from the population. The number of removed indi- 
viduals is chosen in such a way that the number of individuals left in the population is 
always a constant population size, set to 250 individuals (an empirically-determined 
setting) in our experiments. We chose this population-updating strategy mainly be- 
cause it increases selection pressure, in comparison with a conventional generational- 
replacement strategy. This is analogous to the (|a,-i-X)-strategy employed in the second 
EA of this system, described in Section 2.3.2. The main difference is that here we use 
a ([t-tX) strategy on top of tournament selection, whereas the classic ([t-tX) strategy 
uses no such scheme. 

Our system uses a mutation operator where a node is randomly chosen and then the 
subtree rooted at that node is replaced by another randomly-generated subtree. In the 
current version of the system an individual undergoes mutation with a probability of 
20% (an empirically-determined setting), with just one exception. The best individual 
of each generation never undergoes mutation, so that its fitness will never be wors- 
ened. 

2.2.3 Fitness Function 

In order to calculate the fitness of a GP individual, the first step is to compute the 
following counters: 

(true positives) is the number of examples that are covered by at least one of the 
individual’s rules and have the class predicted by those rules; 

(false positives) is the number of examples that are covered by at least one of 
the individual’s rules but have a class different from the class predicted by those 
rules; 

(false negatives) is the number of examples that are not covered by any of the 
individual’s rules but have the class predicted by those rules; 
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(true negatives) is the number of examples that are not covered by any of the 
individual’s rules and do not have the class predicted by those rules. 

Note that the true positives and true negatives correspond to correct predictions 
made by the individual being evaluated, whereas the false positives and the false nega- 
tives correspond to wrong predictions made by that individual. In our system the fit- 
ness of a GP individual is computed by the following formula [9]: 

(TPI(TP + FN)) X ( TN!( FP + TN)) . (2) 

In the data mining literature, in general it is implicitly assumed that the values of 
TP, FP, FN and TN are crisp. This very commonplace assumption is invalid in our 
case, since our system discovers fuzzy rules. In our system an example can be covered 
by a rule antecedent to a certain degree in the range [0..1], which corresponds to the 
membership degree of that example in that rule antecedent. Therefore, the system 
computes fuzzy values for TP, FP, FN and TN. 

The membership degree of record r into the rule set encoded by the individual I is 
computed as follows. For each rule of I, the system computes the membership degree 
of r into each of the conditions of that rule. Then the membership degree for the entire 
rule antecedent is computed by a fuzzy AND of the membership degrees for all the 
rule conditions. This process is repeated for all the rules of the individual I. Then the 
membership degree of the entire rule set is computed by a fuzzy OR of the member- 
ship degrees for all the rules of I. 

For instance, suppose a training example has the class predicted by the individual 
7’s rules. Ideally, we would like that example to be covered by at least one of 7’s rules 
to a degree of 1 , so that the entire rule set of 7 would cover that example to a degree of 
1 . Suppose that 7 has two rules, and that the current training example is covered by 
those rules to degrees of 0.6 and 0.8. Then the fuzzy OR would return a membership 
degree of 0.8 for the entire rule set. This means that the prediction made by the indi- 
vidual is 80% correct and 20% wrong. As a result, this example contributes a value of 
0.8 for the number of true positives and a value of 0.2 for the number of false nega- 
tives. 

2.2.4 Tree Pruning 

Rule pruning is important not only in data mining [3] but also in GP, due to the well- 
known effects of code bloat [14], [1]. Code bloat has greatly affected our system’s 
performance. In our initial experiments, with no pruning at all, some datasets required 
an unacceptable amount of running time. Therefore, we have designed an operator to 
prune GP trees. The basic idea of this operator is to randomly remove conditions from 
a rule with a null coverage - i.e. a rule which does not cover any record - until it cov- 
ers at least one record or until all conditions are removed, which corresponds to re- 
moving the entire rule from its rule set. 

More precisely, each rule of the individual is separated and evaluated by itself. The 
ones that have a null coverage will have some conditions dropped according to the 
following criteria: 

a rule has more than 7 conditions, some conditions are randomly removed until 
the rule has between 5 and 7 conditions (a randomly chosen number). If even after 
this step the rule remains with a null coverage, the next criterion will be applied; 
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the number of conditions of a rule is less than or equal to 7, its conditions will be 
dropped randomly one by one until the rule covers at least one record or all of its 
conditions are dropped, removing the rule completely from the individual. 

This operator is applied to an individual with a 20% probability. However, as an 
individual might be worsened by this operator, it is never applied to the best individ- 
ual of the current generation. The motivation to apply the above operator only to 20% 
of the individuals is to save processing time, since this is a relatively computationally- 
expensive operator. 

After the end of the evolution, the best individual also undergoes a different tree 
pruning. This operator removes two kinds of redundant rules: rules with a null cover- 
age and duplicate rules. This final tree pruning does not alter the fitness of the indi- 
vidual, since the removal of null-coverage/duplicate rules does not alter the set of 
examples covered by an individual’s rule set. 



2.3 The “Population” of Membership Functions 

As mentioned above, in our system the values of all continuous attributes are fuzzified 
into three linguistic values, namely low, medium, and high. These linguistics values 
are defined by trapezoidal membership functions. Each continuous attribute is associ- 
ated with its own membership functions. Hence, the membership functions are dy- 
namically evolved, modifying a set of parameters defining the membership functions, 
to get better adapted to their corresponding attribute. All the parameters of all mem- 
bership functions are encoded into a single individual. This individual is considered as 
a “population” (in a loose sense of the term, of course) separated from the GP popula- 
tion. As mentioned in section 2.1, this single-individual population co-evolves with 
the GP population. 

2.3.1 Individual Representation 

The individual is divided into k parts (or “chromosomes”, loosely speaking), where k 
is the number of attributes being fuzzified. Each chromosome consists of four genes, 
denoted gi, g2, gJ and which collectively define the three membership functions 
(low, medium, and high) for the corresponding attribute, as shown in Fig. 2. Each 
gene represents an attribute value that is used to specify the coordinate of two trape- 
zoid vertices belonging to a pair of “adjacent” membership functions. The system 
ensures that gl < g2 < g3 < g4. 

This individual representation has two advantages. First, it reduces the search space 
of the evolutionary algorithm and saves processing time, since the number of parame- 
ters to be optimized by the evolutionary algorithm is reduced. Second, this representa- 
tion enforces some overlapping between “adjacent” membership functions and guaran- 
tees that, for each original value of the continuous attribute, the sum of its degrees of 
membership into the three linguistic values will be 1, which is intuitively sensible. 
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Fig. 2. Definition of 3 trapezoidal membership functions by 4 genes (gl, g2, g3, g4) 

2.3.2 Evolutionary Algorithm to Evolve Membership Eunctions 

Obviously, it is not possible to perform crossover in the single-individual “population” 
of membership functions. Therefore, the evolution of the single individual represent- 
ing membership functions is the result of a simple evolutionary algorithm, which 
evolves hy means of a (|4-t-X)-evolution strategy (more specifically the (lH-5)-strategy), 
described as follows. 

First of all, the individual is cloned 5 times. Each clone is an exact copy of the 
original individual. Then the system applies to each clone a relatively high rate of 
mutation. Each chromosome (i.e. a block of four contiguous genes, gl, g2, g3 and g4, 
defining the membership functions of a single attribute) has an 80% probability of 
undergoing a single-gene mutation. The mutation in question consists of adding or 
subtracting a small randomly-generated value to the current gene value. This has the 
effect of shifting the coordinate of the trapezoid vertices associated with that gene a 
little to the right or to the left. 

Note that, since a chromosome has four genes and only one of those genes is mu- 
tated, a mutation rate of 80% per chromosome corresponds to a mutation rate of 20% 
per gene. Our motivation to use this relatively high mutation rate is the desire to per- 
form a more global search in the space of candidate membership function definitions. 
If we used a much smaller mutation rate, say 1% or 0.1%, probably at most one gene 
of an entire individual (corresponding to all attributes being fuzzified) would be modi- 
fied. This would correspond to a kind of local search, where a new candidate solution 
being evaluated (via fitness function) would differ from its “parent” solution by only 
one gene, without taking into account gene interactions. By contrast, in our (1 h- 5)- 
evolution strategy scheme a new candidate solution being evaluated differs from its 
“parent” solution by several genes, and the effect of all these gene modifications is 
evaluated as a whole, taking into account gene interactions. This is important, since 
the attributes being fuzzified can interact in such a way that modifications in their 
membership functions should be evaluated as a whole. Actually, the ability to take into 
account attribute interactions can be considered one of the main motivations for using 
an evolutionary algorithm, rather than a local search algorithm. 

In any case, once the 5 clones have undergone mutation, the 5 just-generated indi- 
viduals are evaluated according to a fitness function - which is discussed in the next 
subsection. The best individual is kept and all others are discarded. 

The number of clones (5) used in our experiments was empirically determined as a 
good trade-off between membership-function quality and processing time. 
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2.3.3 Fitness Function 

Recall that the individual of the membership-function population represents defini- 
tions of membership functions to be used for defining rule antecedents being evolved 
by the GP population. Hence, the quality of the individual of the former population 
depends on the predictive accuracy of individuals of the latter population. More pre- 
cisely, in our co-evolutionary scheme the fitness value of the membership-function 
individual is computed as the sum of the fitness values of a group of individuals of the 
GP population. To compute the fitness, the system uses only a small portion of the GP 
population - for the experiments reported in this paper we used the best five individu- 
als - , in order to reduce processing time. 

2.4 Classifying New Examples 

Recall that a complete execution of our system generates one rule set for each class 
found in the data set. These rule sets are then used to classify the examples of the test 
set. For each test example the system computes the degree of membership of that 
example to each rule set (each one predicting a different class). Then the example is 
assigned the class of the rule set in which the example has the largest degree of mem- 
bership. The accuracy rate on the test set is computed as the number of correctly clas- 
sified test examples divided by the total number of test examples, as usual in the clas- 
sification literature. 



3 Computational Results 

We have evaluated our system across four public-domain data sets from the UCI 
(University of California at Irvine) data set repository. These data sets are available 
from http://www.ics.uci.edu/~mlearn/MLRepository.html. Some of these data sets had 
a small number of records with unknown values. Since the current version of our sys- 
tem cannot cope with this problem, those records were removed. All the results re- 
ported below were produced by using a 10-fold cross-validation procedure [9]. 

In order to evaluate the performance of our system we have compared it to two 
other evolutionary systems found in the literature: ESIA [15] and BGP [18]. Both 
ESIA and BGP discover crisp rules. They were chosen for comparison because they 
have been applied to some of the data sets used in our experiments and because they 
have obtained good results in comparison with other data mining systems. The results 
for ESIA and BGP reported here are taken directly from the above-mentioned papers. 
The results for ESIA were also produced by 10-fold cross-validation, whereas the 
results for BGP were produced by generating 30 training and test sets. 

As can be seen in Table 1, our system and ESIA obtained the same accuracy rate on 
the Iris data set. (The numbers between brackets for our system are standard devia- 
tions.) On the other two data sets (CRX and Heart), our system considerably outper- 
forms ESIA. Our system outperforms BGP on the Iris data set, but BGP outperforms 
our system on the Ionosphere data set. 

A possible explanation for the lower performance of the fuzzy rules discovered by 
our system in the Ionosphere data set is suggested by the large number (34) of con- 
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tinuous attributes in that data set. This suggests the possibility that the simple evolu- 
tionary algorithm described in section 2.3 has difficulty in coping with such a rela- 
tively high number of attributes being fuzzified. In other words, in this case the size of 
the search space may be too large for such a simple evolutionary algorithm. This hy- 
pothesis will be further investigated in future work. 

Table 1. Accuracy rate (on test set), in %, of our system , ESIA and BGP 



Data set 


Our system 


ESIA 


BGP 


CRX 


84.7 (±3.5) 


77.39 


N/A 


Heart (statlog) 


82.2 (±7.1) 


74.44 


N/A 


Ionosphere 


88.6 (±6.0) 


N/A 


89.2 


Iris 


95.3 (±7.1) 


95.33 


94.1 



Overall, we consider these results very promising, bearing in mind that, unlike 
ESIA and BGP, our system has the advantage of discovering fuzzy rules, which tend 
to be more intuitive for a user than the “hard” thresholds associated with continuous 
attributes in crisp rules. On the other hand, like most evolutionary algorithms, our co- 
evolutionary system needs a good amount of computational time to run. More pre- 
cisely, a single iteration of cross-validation took a processing time varying from a 
couple of minutes for the Iris data set to about one hour for the CRX data set - results 
obtained for a dual-processor Pentium II 350. Shorter processing times may be ob- 
tained by the use of parallel data mining techniques [8], but this point is left for future 
research. 



4 Conclusions and Future Research 

We have proposed a co-evolutionary system for discovering fuzzy classification rules. 
The system uses two evolutionary algorithms: a genetic programming (GP) algorithm 
evolving a population of fuzzy rule sets and a simple evolutionary algorithm evolving 
a population of membership function definitions. The two populations co-evolve, so 
that the final result of the co-evolutionary process is a fuzzy rule set and a set of mem- 
bership function definitions that are well adapted to each other. 

The main advantage of this co-evolutionary approach is that the fitness of a given 
set of membership function definitions is evaluated across several fuzzy rule sets, 
encoded into several different GP individuals, rather than on a single fuzzy set. This 
makes that evaluation more robust. In order to mitigate the problem of long processing 
times, our system evaluates a set of membership function definitions only across the 
few best GP individuals. 

In addition, our system also has some innovative ideas with respect to the encoding 
of GP individuals representing rule sets. The basic idea is that our individual encoding 
scheme incorporates several syntactical restrictions that facilitate the handling of rule 
sets in disjunctive normal form. We have also adapted GP operators to better work 
with the proposed individual encoding scheme. 
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We have evaluated our system across four public domain data sets and compared it 
with two other evolutionary systems (ESIA and BGP) found in the literature which 
used the same data sets. Our results can be summarized as follows: 

(a) Our co-evolutionary system considerably outperforms ESIA in two out of three 
datasets and equals it in the other data set, with respect to predictive accuracy. 

(b) Our system is competitive with BGP in two data sets. (In one data set our sys- 
tem outperforms BGP, whereas BGP outperforms our system in the other data set.) 

(c) Our system has the advantage of discovering fuzzy rules, which tend to be more 
intuitive for the user than the crisp rules discovered by ESIA and BGP. 

There are several directions for future research. Eor instance, the GP tree pruning 
operator currently used in our system is a “blind” operator, in the sense that tree nodes 
to be pruned are randomly chosen. It seems that a promising research direction would 
be to design a more “intelligent” pruning operator, which would choose the tree nodes 
to be pruned based on some estimate of the predictive power of those tree nodes. 

Note that the above suggested research direction concerns improvement in the GP 
algorithm used by our system. However it seems that the most important point to in- 
vestigate in future research is the performance of the simple evolutionary algorithm 
for evolving membership function definitions. It is possible that the current version of 
this algorithm is not robust enough to cope with data sets having a large number of 
attributes being fuzzified. This hypothesis must be further investigated in the future, 
which might lead to improvements in the current version of this simple evolutionary 
algorithm. 



References 

1. W. Banzhaf, P. Nordin, R.E. Keller, Francone FD Genetic Programming - an Introduc- 
tion. Morgan Kaufmann, 1998. 

2. P.J. Bentley. “Evolutionary, my dear Watson” - investigating committee-based evolution 
of fuzzy rules for the detection of suspicious insurance claims. Proc. Genetic and Evolu- 
tionary Computation Conf. (GECCO-2000), 702-709. Morgan Kaufmann, 2000. 

3. L.A. Breslow and D.W. Aha. Simplifying decision trees: a survey. The Knowledge Engi- 
neering Review, 12(1), 1-40. Mar. 1997. 

4. M. Delgado, F.V. Zuben and F. Gomide. Modular and hierarchical evolutionary design of 
fuzzy systems. Proc. Genetic and Evolutionary Computation Conf. (GECCO-99), 180-187. 
Morgan Kaufmann, 1999. 

5. U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth. From data mining to knowledge discov- 
ery: an overview. In: U.M. Fayyad et al. (Eds.) Advances in Knowledge Discovery & Data 
Mining, 1-34. AAAI/MIT, 1996. 

6. C.S. Fertig, A.A. Freitas, L.V.R. Arruda and C. Kaestner. A Fuzzy Beam-Search Rule 
Induction Algorithm. Principles of Data Mining and Knowledge Discovery (Proc. 3rd 
European Conf - PKDD-99). Lecture Notes in Artificial Intelligence 1704, 341-347. 
Springer- Verlag, 1999. 

7. A.A. Freitas. A survey of evolutionary algorithms for data mining and knowledge discov- 
ery. To appear in: A. Ghosh and S. Tsutsui. (Eds.) Advances in Evolutionary Computation. 
Springer- Verlag, 2001. 

8. A.A. Freitas and S.H. Lavington. Mining Very Large Databases with Parallel Processing. 
Kluwer Academic Publishers, 1998. 




Discovering Fuzzy Classification Rules with Genetic Programming and Co-evolution 



325 



9. DJ. Hand. Construction and Assessment of Classification Rules. John Wiley &Sons, 1997. 

10. H. Ishibuchi and T. Nakashima. Linguistic rule extraction by genetics-based machine 
learning. Proc. Genetic and Evolutionary Computation Conf. (GECCO-2000), 195-202. 
Morgan Kaufmann, 2000. 

11. H. Ishibuchi, T. Nakashima and T. Kuroda. A hybrid fuzzy GBML algorithm for designing 
compact fuzzy rule-based classification systems. Proc. 9th IEEE Int. Conf. Fuzzy Systems 
(FUZZ IEEE 2000), 706-71 1. San Antonio, TX, USA. May 2000. 

12. C.Z. Janikow. A knowledge-intensive genetic algorithm for supervised learning. Machine 
Learning 13, 189-228. 1993. 

13. G.J. Klir and B. Yuan. Fuzzy Sets and Fuzzy Logic. Prentice-Hall, 1995. 

14. W.B. Langdon, T. Soule, R. Poll and J.A. Foster. The evolution of size and shape. In: L. 
Spector, W.B. Langdon, U-M. O’Reilly and P.J. Angeline. (Eds.) Advances in Genetic 
Programming Volume 3, 163-190. MIT Press, 1999. 

15. J.J. Liu and J.T. Kwok. An Extended Genetic Rule Induction Algorithm. Proc. Congress 
on Evolutionary Computation (CEC-2000). La Jolla, CA, USA. July 2000. 

16. D.J. Montana. Strongly typed genetic programming. Evolutionary Computation 3(2), 199- 
230. 1995. 

17. C.A. Pena-Reyes and M. Sipper. Designing breast cancer diagnostic systems via a hybrid 
fuzzy-genetic methodology. Proc. 8th IEEE Int. Conf. Fuzzy Systems. 1999. 

18. S.E. Rouwhorst and A.P.Engelbrecht. Searching the Forest: Using Decision Tree as Build- 
ing Blocks for Evolutionary Search in Classification. Proc. Congress on Evolutionary 
Computation (CEC-2000), 633-638. La Jolla, CA, USA. July 2000. 

19. D. Walter and C.K. Mohan. ClaDia: a fuzzy classifier system for disease diagnosis. Proc. 
Congress on Evolutionary Computation (CEC-2000), 1429-1435. La Jolla, CA. 2000. 

20. M.L. Wong and K.S. Leung. Data Mining Using Grammar Based Genetic Programming 
and Applications. Kluwer, 2000. 

21. N. Xiong and L. Litz. Generating linguistic fuzzy rules for pattern classification with 
genetic algorithms. Principles of Data Mining and Knowledge Discovery (Proc. PKDD- 
99) Lecture Notes in Artificial Intelligence 1704, 574-579. Springer- Verlag, 1999. 




Sentence Filtering for Information Extraction in 
Genomics, a Classification Problem 

Claire Nedellec*, Mohamed Quid Abdel Vetah^’ and Philippe Bessieres^ 



‘LRI UMR 8623 CNRS, Universite Paris-Sud, 91405 Orsay cedex 
cn@lri.fr 

^ValiGen SA, Tour Neptune, 92086 La-Defense 
ould@lri.fr 

^ Mathematique, Informatique et Genome (MIG) INRA, 78026 Versailles cedex 
philb@biotec.jouy.inra.fr 



Abstract. In some domains. Information Extraction (IE) from texts requires 
syntactic and semantic parsing. This analysis is computationally expensive and 
IE is potentially noisy if it applies to the whole set of documents when the rele- 
vant information is sparse. A preprocessing phase that selects the fragments 
which are potentially relevant increases the efficiency of the IE process. This 
phase has to be fast and based on a shallow description of the texts. We applied 
various classification methods — IVI, a Naive Bayes learner and C4.5 — to this 
fragment filtering task in the domain of functional genomics. This paper de- 
scribes the results of this study. We show that the IVI and Naive Bayes methods 
with feature selection gives the best results as compared with their results with- 
out feature selection and with C4.5 results. 



1 Introduction 

As an increasing amount of information becomes available in the form of electronic 
documents, the need for intelligent text processing makes shallow text understanding 
methods such as Information Extraction (IE) particularly useful. Up to now, IE has 
been restrictively defined hy DARPA's MUC (Message Understanding Conference) 
program [10] as the task of extracting specific, well-defined types of information from 
natural language texts in restricted domains with the specific objective of filling pre- 
defined template slots and databases. We claim that in many domains, IE systems have 
to rely on deep analysis methods local to the relevant fragments. They should combine 
the semantic-conceptual analysis of text understanding methods and information ex- 
traction hy pattern matching; in a first step the relevant textual fragments are filtered 
based on shallow criteria; in a second step, a representation of the content of the frag- 
ments is built by successive interpretation operations based on syntactico-semantic 
lexicon following a classical approach in text understanding, finally, extraction rules 
are applied to the resulting interpretations in order to identify the relevant information 
and store it in a database in the suitable format, usually by filling forms in the MUC 
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case. These three steps differ by the nature of the knowledge that they exploit and by 
the complexity of the methods applied. The second step, that is, the syntactico- 
semantic parsing is the most expensive in terms of resources. The first step, i.e. the 
filtering of the relevant fragments, allows to limit that analysis to what is needed only, 
by focussing it on the fragments that potentially contain relevant information. This 
selection is even more crucial as the information to be extracted is sparser. The 
sparseness problem had been pointed out in previous research in IE [15] and [16] but 
no practical solution has been proposed. The main consequence is that the first step 
must be fast, even if this implies some lack of precision. It must thus be based on a 
shallow description of the text. The application of learning to the filtering of relevant 
fragments has received little attention in IE compared to other tasks such as learning 
for name entity recognition or learning extraction patterns [15, 16]. This lack of inter- 
est is due to the type of texts that are generally handled by IE, which are those pro- 
posed in the MUC competition. Those texts are usually short and the information to be 
extracted is generally dense, so that prefiltering is less or not needed at all. The type of 
information to be extracted such as company names or a seminar starting times often 
requires only a shallow analysis, the computational cost of which is low enough to 
avoid prefiltering. This is not the case in other IE tasks such as identifying gene inter- 
action in functional genomics, the application that we describe here. 

From a Machine Learning point of view, filtering can be viewed as a classification 
problem. Textual fragments have to be classified in two classes: potentially relevant 
for IE or not. The learning examples represent fragments, (sentences in this applica- 
tion) and the example attributes are the significant and the lemmatized words (in a 
canonical form) of the sentences. We compared experimentally the classification 
method IVI proposed in [12] for IE in functional genomics, a Naive Bayes (NB) 
method [9], and a decision tree-based method, C4.5 [14], on three different datasets in 
functional genomics described in section 2. As a consequence of the example repre- 
sentation, the datasets are very sparse in the attribute space; the examples are de- 
scribed by few attributes. Thus, in addition to the basic methods, we studied the effect 
of feature selection as a preprocessing step. The objective of this study is to identify 
the best classification methods for filtering sentences in functional genomics and to 
characterize the corpora with respect to these methods. This paper reports our results 
on comparing classification methods. The methods and the evaluation protocol are 
detailed in section 3. Section 4 reports and discusses the experimental results. Future 
work is presented in section 5. 



2 The Application Domain; Functional Genomics 

2.1 A Genomics Point of View on IE 

The application problem to which applying IE is here about modeling the gene inter- 
actions from text, in the domain of functional genomics. This problem has been previ- 
ously described in [1, 12, 11, 18] among others. The existence of numerous scientific 
and technical domains sharing strong common aspects with functional genomics, from 
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a document point of view, will allow adapting the methods developed here to other 
application domains. This is typically the case for related domains in biology, but 
more generally, the methods will be transposable and exploitable in any application of 
knowledge extraction from scientific and technical documents. 

Modeling interactions between genes is of significant interest for biologists, because it 
is a prerequisite step towards the understanding of the cell functioning. To date, most 
of the biological knowledge about these interactions is not described into databanks, 
but only in the form of scientific summaries and articles. Therefore, their exploitation 
is a major milestone towards building models of interactions between genes. Actually, 
genome research projects have generated new experimental approaches like DNA 
chips at the level of the whole organisms. A research team is now able to quickly 
produce thousands of measurements. This very new context for biologists is calling 
for automatic extraction of knowledge from text, to be able to interpret and making 
sense of elementary measurements from the laboratory by linking them to scientific 
literature. The bibliographic databases can be searched via Internet using keyword 
queries that retrieve a superset of the relevant paper abstracts. For example, the query 
"Bacillus subtilis transcription" related to the gene interaction topic retrieves 2209 
abstracts. 



Extract of a MedLine abstract on Bacillus subtilis. 

UI - 99175219 [ . . ] 

AB - [..] It is a critical regulator of cot genes encoding proteins 
that form the spore coat late in development. Most cot genes, and the 
gerE gene, are transcribed by sigmaK RNA polymerase. Previously, it 
was shown that the GerE protein inhibits transcription in vitro of the 
sigK gene encoding sigmaK. Here, we show that GerE binds near the sigK 
transcriptional start site, [..] 

Then the biologist has to identify the relevant fragments, (in bold-face in the example) 
in the abstracts and to extract the useful knowledge with respect to the goal of identi- 
fying gene interaction. Then, he has to represent it in a structured way so that it can be 
recorded in a database for further querying and processing. The more general goal is 
to identify all the interactions and molecular regulations and to build a functional 
network. 



Example of a form filled with the information extracted from the sentence in the example. 



Interaction 


Type: negative 
Agent: GerE protein 






Target: 


Expression 


Source: sigK gene 
Product: sigmaK protein 



This domain is representative of the scope of our study on automatizing filtering of 
relevant fragment for IE: the information to be extracted is local, mainly located in 
single sentences or part of sentences. It is very sparse in the document set. For in- 
stance, only 2.5 % (470) of the 20000 sentences contain relevant information on gene 
interaction in the 2209 Bacillus subtilis abstracts mentioned above. We contend that 
the information extraction has to rely on a deep analysis. Indeed previous approaches 
based on shallow descriptions of the texts (e. g. IE techniques such as transducers 
defined manually and based on significant verb and gene names [1, 11, 18]) or on 
statistic measures of keywords co-occurrences [12, 17] (e.g. information retrieval- 
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based techniques) yield limited results with either a bad recall or a low precision. The 
following example illustrates some of the problems encountered: 

"GerE |stimulates| cotD transcription and |inhibits| cotA transcription in 
vitro by sigma K RNA polymerase, as expected from in vivo studies, and, 
unexpectedly, profoundly |inhibits| in vitro transcription of the gene 
(sigK) that |encode] sigma K.". 

The IE methods based on keywords or gene names (bold-face) and interaction verbs 
(framed) are not able to identify the inhibition interaction between GerE and sigK 
gene transcription (28 words far) or, if they will, also erroneously identify interactions 
between cotD and sigK and between cotA and sigK. Extracting relevant knowl- 
edge in the selected documents thus requires more complex IE methods such as syn- 
taxico-semantic methods based on lexical and semantic resources specific to the do- 
main'. The characteristics of this application thus perfectly fit the requirements for 
applying classification methods for filtering relevant text as an IE preprocessing step. 



2.2 Textual Corpora and Learning Sets 

The robustness of the classification methods has been evaluated with respect to differ- 
ent writing styles, different biological species, and then different gene interaction 
models. The classification methods chosen have been applied, evaluated and com- 
pared on three different datasets. These sets have been built from paper abstracts 
about three species: the first set, denoted Dro, is about a fly. Drosophila 
melanogaster^ , the second, denoted Bs, is about a bacterium. Bacillus subtilis^ and the 
third, denoted HM, is about the mouse and the human''. They come from two biblio- 
graphic databases with different writing styles. The Dro dataset is from FlyBase, the 
database devoted to Drosophila genes. Its abstracts are concise, 2 or 3 sentences long, 
the sentences short and the syntax quite simple. The two others are from MedLine, the 
generalist biology bibliographic database. The abstracts of MedLine are longer, 
around 10 sentences, in more complex syntactic forms than those of FlyBase. The 
abstracts have been selected by the queries "Bacillus subtilis transcription" for Bs 
dataset and Telomere, Apoptose, DNA replication, DNA repair, cell cycle control, 
two-hybrid and interaction for HM. The examples sets have been selected in the ab- 
stracts under the locality assumption that the sentence level is the suitable granularity 
degree in this IE application, as it is often the case in Machine Learning for IE appli- 
cations, [15] and [16]. It is assumed that the potentially relevant sentences in the Bs 
and HM sets contain at least two gene or protein names denoting the agents of the 
interaction as in previous work. In the Dro set as it has been provided to us, the sen- 
tences contain exactly two gene or protein names. This difference should not affect the 
filtering phase but the extraction phase only. The identification of gene names identi- 
fication for the Dro and HM set has been done manually by LGPD-IBDM biologists. 
This manual selection results in 530 abstracts Dro set, and 105 abstracts and 962 sen- 



^ This is the goal of the Caderige project of which this research is part. 

^ The Dro example set has been provided as such by B. Jacq and V. Fillet from LGPD-IBDM. 

o 

^ This set has been built by P. Bessieres (MIG, INRA) in the Caderige project. 

^ It has been provided as such by the LGPD-IBDM and the ValiGen company. 
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tences for HM set that have been provided to us as such. This manual processing 
affects the classification results as it will be shown in section 4. The sentence selec- 
tion for the Bs set has been automatically done with the help of a list of gene and 
protein names of Bacillus subtilis and their derivations provided by MIG and manu- 
ally completed by new derivations observed in the corpus. The problem of the 
automatic identification of gene names in genomics document has been recently 
studied and recognized as a prerequisite for any further automatic document proc- 
essing because of the lack of exhaustive dictionary and because of the varying nota- 
tion [2, 5, 6, 13]. 



Table 1. Features of the example sets 





Dro 


Bs 1 HM 


Document data base 


FlyBase 


MedLine 


# bibliographic references 


> 100 000 


around 16 Millions 


# sentences per abstract 


2,3 


approximatively 10 


species 


Drosophila 


Bacillus subtilis 


mouse - human 


# biblio. references to the species 


20 300 


15 213 


4 067 879 


# abstracts selected (queries) 


20 300 


2209 


32448 


# abstracts selected after manual step 


530 


Not relevant 


105 


# sentences in the abstracts 


5 244 


around 20 000 


962 


# sentences filtered (at least 2 gene names) = 

# examples 


1197 


932 


407 


# attributes 


1701 


2340 


1789 


# positive examples (PosEx) 


653 


470 


240 


# negative examples (NegEx) 


544 


462 


167 



Training example of Bs dataset built from the sentence, which illustrates Sect. 2.1 



Example : addition stimulate transcription inhibit transcription 
vitro RNA polymerase expected vivo study unexpectedly profoundly 
inhibit vitro transcription gene encode 
Class : Positive 



The attributes that describe the learning examples represent the significant and lemma- 
tized words of the sentences. They are boolean in the case of C4.5 and they represent 
the number of occurences in the sentence in the other cases, i.e, IVI and NB. The 
examples have been classified into the positive and the negative categories, i.e. de- 
scribing at least one interaction (positive) or none at all (negative). The HM and Bs 
sentences have been lemmatized using Xerox shallow parser. Stopwords such as de- 
terminant have been removed as non-discriminant with the help of the list provided by 
Patrice Bonhomme (LORIA). It initially contains 620 words and it has been revised 
with respect to the application. After stopwords removal, the three example sets re- 
main very sparse in the feature. Half of the attributes describe a single example. The 
capacity to deal with data sparseness was thus one of the criteria for choosing the 
classification methods. 
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3 Classification Methods 



3.1 Method Descriptions 



The classification method IVI had been applied to Dro dataset [12], It is based on the 
example weight measure defined by (2), which is itself based on the attribute weight 
measure defined by (1) where occ(Attj,exj) represents the value, (i.e., the number of 
occurrences) of the attribute i for the example j. The class of the example is deter- 
mined with respect to a threshold experimentally set to 0. Examples with weights 
above (resp. below) the threshold are classified as positive (resp. negative). 



Weight(Att.) 



[ occ(Att^,eXj ) [ occ(Att-,eXj ) 

eXj PosEx eXj NegEx 

! occ(Attj,eXj) 

ex. Ex 



( 1 ) 



IVI(ex) 



|Alt(Ex)| 

! Weight(Att; ) 



i 1 



( 2 ) 



The Naive Bayes method (NB) as defined by [9], seemed to be suitable for the prob- 
lem at hand because of the data sparseness in the attribute space. As IVI, NB estimates 
the probabilities for each attribute to describe positive examples and negative exam- 
ples with respect to the number of their occurrences in the training set. The probability 
that a given example belongs to a given class is estimated by (4), the product of the 
probability estimations of the example attributes, given the class. The example is as- 
signed to the class for which this probability is the highest. 

! occ(Attj,ex^) (3) 

T-v ^ X exi. ClasS; 

Pr(AttJClaxXi) 

! ! occ(Attj,eX|,) |Class| 

1 1 ex,. Class, 

|Att(ex)| x.x 

Pr(ex|Class. ) Pr(AttJClasSj) ^ ^ 

j 1 



The Laplace law (3) yields better results here as compared with the basic estimate 
because its smoothing feature deals well with the data sparseness. The independence 
assumption of the attributes is obviously not verified here also previous work has 
shown surprisingly good performances of NB despite of this constrain [4]. The third 
class of methods applied is C4.5 and C4.5Rules. Compared to NB and IVI, the deci- 
sion tree computed by C4.5 is more informative and explicit about the combination of 
attributes that denote interactions, and thus potentially on the phrases that could be 
useful for further information extraction. 



3.2 Feature Selection 

The data sparseness is potentially a drawback for C4.5 Feature selection appears here 
as a good way to filter the most relevant attributes for improving classification [19] 
but also for selecting the suitable corpus for other IE preprocessing tasks such as se- 
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mantic class learning (Sect. 5). This latter goal has motivated the choice a filtering 
method for feature selection instead of a wrapper method selection [7], where the 
classification algorithms would he repeatedly applied and evaluated on attribute sub- 
sets in order to identify the best subset and the best classifier at the same time [8]. The 
measure of attribute relevance used here is based on (5). It measures the capacity of 
each attribute to characterize a class, independently of the other attributes and of the 
classification method. The attributes are all ranked according to this measure and the 
best of them are selected for describing the training sets (Sect. 4). 



DiscrimP(Att) 



IClasil 

! Max Pr(Att,Cli),l 



i I 



|Class| 



Pr(Att,Cli) 



(5) 



3.3 Evaluation Metrics 



The methods have been evaluated and compared with the usual criteria, that is, recall 
(7), precision (8), and the F-measure (9), computed for the three datasets. 



Recall(ClasSj ) 



Precision(Class j ) 



I Ex ClasS; and assigned to ClassJ 
|Ex ClassJ 

|Ex ClasSj and assigned to ClassJ 
|Ex classified in ClassJ 



( 6 ) 



(7) 



p ( ^ 1)* Precision* Recall 

( ^ * Precision) Recall 

More attention is given to the results obtained for the positive class because the 
examples classified as positive only will be transferred to the IE component. The re- 
call rate for this class should therefore be high even if this implies some lack of preci- 
sion. The factor of the F-measure has been experimentally set to 1.65 in order to 
favor the recall. IVI and BN have been evaluated by leave-one-out on each dataset. 
For performance reasons, C4.5 and C4.5Rules have been only trained on 90 % of the 
learning sets and tested on the remaining 10 %. The results presented here are com- 
puted as the average of the test results for ten independent partitions. 



4 Evaluation 

4.1 Comparison of the IVI, C4.5, and BN Methods 

The first experiments allow the comparison of C4.5, C4.5Rules, NB and IVI on the 
three datasets (Table 2). As recall and precision computed for two classes yields to the 
same rates, they appear in a same line. NB has been applied here with the Laplace law. 
In the three cases, NB and IVI results are better than C4.5 and C4.5Rules results. This 
can be explained by the sparseness and the heterogeneity of the data. The global preci- 
sion rate is 5 to 8 % higher and the precision rate for the positive class is 4 to 12 % 




Sentence Filtering for Information Extraction in Genomics, a Classification Problem 



333 



higher. However, the good behavior of the IVI-BN family is not verified by the recall 
rate for the positive on the Dro dataset: C4.5 recall rate is better than NB and IVI on 
this set (13 %) but worse on Bs' and HM's ones (-12 to -13 %). The origin of Dro 
dataset could explain these results: it comes from FlyBase where the sentences are 
much shorter than those of MedLine, from which Bs and HM are extracted. Thus Dro 
examples are described by less attributes although the ratio of the number of attributes 
to the examples is similar to Bs one. This could explain the overgenerality of C4.5 
results on Dro set illustrated by the high recall and bad precision rates. The analysis of 
NB and IVI results shows that NB behaves slightly better at a global level. 



Table 2. Comparison of C4.5, C4.5Rules, IVI and BN on the three datasets 



Corpus 


Dro 


Bs 


HM 


Method 


C4.5 


C4.5 


BN 


IVI 


C4.5 


C4.5 


BN 


IVI 


C4.5 


C4.5 


BN 


IVI 






R 








R 








R 






Recall Positive 


88,9 


86,8 


75,3 


69,1 


63,9 


71,4 


85,7 


82,6 


88,3 


84,5 


97,1 


90 




2.4 


2.6 


2.9 


3.5 


4.3 


4.1 


3.2 


3.4 


4.1 


4.1 


2.1 


3.8 


Precision Positive 


68,1 


70,5 


82 


83,1 


63,4 


62,8 


66,6 


67,4 


63,7 


64,2 


68,5 


70,3 




3.6 


3.5 


3.2 


2.8 


4.3 


4.4 


4.3 


4.2 


6.1 


6.1 


5.9 


5.8 


Recall-precision 


72 


73.6 


77,5 


75,4 


62,4 


62,9 


71,1 


71 


63,7 


63,4 


72 


71,5 


for all 


2.5 


2.5 


2.4 


2.4 


3.1 


3.1 


2.9 


2.9 


4.1 


4.7 


4.4 


4.4 



However, their behaviors on the positive examples are very different: NB achieves a 
higher recall than IVI (3 to 7 %) while IVI achieves a better precision than NB (1 to 2 
%) but the difference is smaller. The higher recall and precision rates for positive on 
HM compared to Bs is explained by the way the HM set has been built. The selection 
of the sentences in the abstracts has been done manually by the biologists among a 
huge number of candidate sentences (Table 1) and the bias of the choice could explain 
the homogeneity of this dataset compared to Bs which has been selected auto- 
matically. This hypothesis has been confirmed by further experiments on the reusabil- 
ity of the classifiers learned from one corpus and tested on others. As a better recall is 
preferred in our application, the conclusion on these experiments is that NB should be 
preferred for data from MedLine (Bs and HM) while for FlyBase (Dro), it would 
depend on how much the IE component would be able to deal with sentences filtered 
with a low precision. C4.5 should be chosen if the best recall is preferable while BN 
should be chosen for its best recall-precision tradeoff. 



4.2 Feature Selection 

As described in Sect. 3, the attributes for each dataset have been ranked according to 
their relevance. For instance, the best attributes for the Dro set are, downstream, 
interact, modulate, autoregulate, and eliminate. The effect of feature 
selection on the learning results of IVI, NB and C4.5Rules methods has been evalu- 
ated by selecting the best n attributes, n varying from hundred to the total number of 
attributes, by increments of hundred. 
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4.2.1 Effect of Feature Selection on NB Results 

For the three sets, the recall noticeably increases and the precision noticeably decreases 
with the number of relevant attributes selected, which is what is expected, (Fig. 2, Fig. 3 
and Fig. 4). The F-measure increases in the first quarter, more or less stabilizes on a 
plateau on a half, slightly increasing since recall is predominant over precision in our 
setting of F-measure (section 3), and then decreases in the last quarter or fifth, after a 
small pick in the case of Dro and Bs sets. According to the F-measure, the best attribute 
selections in terms of the recall - precision compromise are thus at the end of the plateau 
around 3/4 - 4/5 of the total number of attributes. For the Dro set, it is around 1400 
attributes and for Bs set it is around 1900 attributes. One can notice that the recall for 
positive examples for the Dro and Bs sets is 10 to 15 % higher than the global recall and 
that is the opposite for the precision, which is exactly what is desirable in our application. 




Fig. 2. NB classification results after feature selection on Dro set. 




Fig. 3. NB classification results after feature selection on Bs set. 

For the HM set, this phenomenon is even more noticeable: the recall of the positive is 
very high, close to 100 %, and 20 % higher than the global recall (Fig. 4). Compared 
to the other sets the plateau is more horizontal between 400 et 1900 attributes after a 
slight increase between 400 and 800, and there is no pick before the decrease, then the 
global recall-precision rate is stable between 800 and 1400 and all points are equiva- 
lent in this interval. This could be explained by the homogeneity of the HM dataset 
that affected the initial classification results in the same way (4.1). 
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fattributes 



Fig. 4. NB classification results after feature selection on HM set 

Table 3 presents a summary of the results obtained with NB without and after feature 
selection for the best attribute. NB results are improved by feature selection. The gain 
is very high for HM, around 10 %, less for Bs (6-7 %), and 4-5 % for Dro. 



Table 3. Comparison of NB results with the best feature selection level 



Dataset 


Dro 


Bs 


HM 


# attrib- 
utes 


all att. 
1701 


1400 


all att. 
2340 


1800 


All att. 
1789 


900-1300 


Rec. 

Positive 


75,3 2.9 


79 3.1 


85,7 3.2 


90,8 2.6 


97,1 2.1 


99,6 0.8 


Free. 

Positive 


82 3.2 


86,4 2.6 


66,6 4.3 


74,1 4.00 


68,5 5.9 


76,1 5.4 


Free. -Rec. 
for all 
classes 


77,5 2.4 


Rec. 81,8 2.2 
Free. 82,1 2.2 


71,1 2.9 


Rec. 77,5 2.7 
Free. 79,9 2.6 


72 4.4 


Rec. 81,1 3.8 
Free. 81,3 3.8 



4.2.2 Effect of Feature Selection on C4.5 and IVI Results 

Similar experiments have been done with C4.5. There are summarized in Table 4. 



Table 4. Comparison of C4.5 results with the best feature selection level 



Dataset 


Dro 


Bs 


HM 


# attributes 


all at. 1701 


1400 


all at. 2340 


1600 


All at. 1789 


1300 


Recall Fos. 


86,8 2.6 


84,5 2.8 


71,4 4.1 


70,1 4.2 


84,5 4.6 


84,6 4.6 


Precision Pos. 


70,5 3.5 


75 3.33 


62,8 4.4 


71,4 4.13 


64,2 6.1 


78,8 4.6 


Prec-Recall for all 


73.7 2.5 


75,3 2.4 


62,9 3.1 


71,1 3 


63,4 4.7 


74,9 5.2 



The conclusions are similar to NB ones: feature selection improves the global classifi- 
cation results for all sets, the global improvement is important for Bs and HM (9 %), 
and less for Dro (1,6 %) for the same reasons related to the origin of the corpora as 
previously pointed out. 

The similar experiments done with IVI are summarized in Table 5. The improvement 
is higher for IVI than for the two other methods. Its range is between approximately 
+6 % for Dro, -I-IO % for Bs to h- 16 % for HM. 
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Table 5. Comparison of IVI results with the best feature selection level. 



Dataset 


Dro 


Bs 


HM 


# attributes 


all at. 
1701 


1300 


all at. 2340 


1900 


all at. 1789 


1400 


Recall Pos. 


69 3.5 


77,9 3.2 


82,6 3.42 


91,5 2.5 


90 3.8 


98,3 1.6 


Free. Pos. 


83,6 

2.9 


88,4 2.5 


67,4 4.23 


78,3 3.7 


70,3 5.8 


83,4 4.7 


Prec.-Rec. 
for all 


75,4 2. 
4 


Rec. 81,9 2.2 
Free. 84,1 2.1 


71 2.9 


Rec. 82,8 2.4 
Free. 83,2 2.4 


71,5 4.4 


Rec. 87,5 1.6 
Free. 87,5 4.7 



4.2.3 Conclusion on the Effect of Feature Selection on Classification 

The comparison between the experimental results with C4.5, NB and IVI for the best 
feature selection shows that IVI globally behaves better than the two others do. With 
respect to the recall rate for positive, NB behaves slightly better or similarly to IVI (1 to 
2 %) while IVI precision rates are better than NB ones (2 to 7 %). Therefore, in the case 
where the good positive recall is preferred NB with feature selection should be chosen 
for all datasets except for those like Dro that are less sparse and more homogeneous and 
where C4.5 without feature selection is better. In the case where a best recall-precision 
compromise is preferred, IVI with feature selection should be applied. 



5 Future Work 

This research focuses on the classification of sentences represented by their significant 
and lemmatized words. The methods studied yield global recall and precision rates 
higher than 80 % and high recall rates for the positive class with feature selection by 
prefiltering. Other criteria should be tested for selecting the attributes, such as infor- 
mation gain and mutual information. Better results should also be obtained with classi- 
fication with more information gain global measures that would take into account the 
dependency between the words which form significant noun phrases. For instance the 
results of the ongoing work at LIPN on the acquisition of terminology for gene inter- 
action should reduce both the number of attributes and their dependency. We also plan 
to study the reduction of the number of attributes by replacing in the examples, the 
words by the concept (the semantic class) they belong to as learnt from a biological 
corpus. Moreover, classification should be improved by reducing the data heterogene- 
ity by pre-clustering the examples; one classifier would then be learned per example 
cluster. From an IE point of view, the assumption that relevant sentences contain at 
least two gene or protein names should be relaxed. The attribute ranking will be used 
to identify automatically other potentially relevant sentences. Finally learning extrac- 
tion rules requires semantic class acquisition. The attribute ranking will be also used 
to select the most relevant syntagms in the training corpora for learning semantic 
classes. Learning will thus focus on the potentially most relevant concepts with respect 
to the extraction task. 
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Abstract. This paper introduces a new type of Self- Organizing Map 
(SOM) for Text Gategorization and Semantic Browsing. We propose a 
“hyperbolic SOM” (HSOM) based on a regular tesselation of the hy- 
perbolic plane, which is a non-euclidean space characterized by constant 
negative gaussian curvature. This approach is motivated by the observa- 
tion that hyperbolic spaces possess a geometry where the size of a neigh- 
borhood around a point increases exponentially and therefore provides 
more freedom to map a complex information space such as language into 
spatial relations. These theoretical findings are supported by our experi- 
ments, which show that hyperbolic SOMs can successfully be applied to 
text categorization and yield results comparable to other state-of-the-art 
methods. Furthermore we demonstrate that the HSOM is able to map 
large text collections in a semantically meaningful way and therefore 
allows a “semantic browsing” of text databases. 



1 Introduction 

For many tasks of exploraty data analysis the creation of Self-Organizing Maps 
(SOM) for data visualization, as introduced by Kohonen more than a decade 
ago, has become a widely used tool in many fields 0. 

So far, the overwhelming majority of SOM approaches have taken it for 
granted to use (some subregion of) a flat space as their data model and, moti- 
vated by its convenience for visualization, have favored the (suitably discretized) 
euclidean plane as their chief “canvas” for the generated mappings (for a few no- 
table exceptions using tree- or hypercubical lattices see e. g. 0,la, Q]). 

However, even if our thinking is deeply entrenched with euclidean space, an 
obvious limiting factor is the rather restricted neighborhood that “fits” around 
a point on a euclidean 2d surface. Recently, it has been observed that a particu- 
lar type of non-euclidean spaces, the hyperbolic spaces that are characterized W 
uniform negative curvature, are very well suited to overcome this limitation 
since their geometry is such that the size of a neighborhood around a point in- 
creases exponentially with its radius r (while in a D-dimensional euclidean space 
the growth follows the much slower power law r^). This exponential scaling 
behavior fits very nicely with the scaling behavior within hierarchical, tree-like 



L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 33S-y4^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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structures, where the number of items r steps away from the root grows as 
where b is the (average) branching factor. This interesting property of hyperbolic 
spaces has been exploited for creating novel displays of large hierarchical struc- 
tures that are more accessible to visual inspection than in previous approaches 

0 ]- , 

Therefore, it appears very promising to use hyperbolic spaces also in con- 
junction with the SOM. The resulting hyperbolic SOMs (HSOMs) are based 
on a tesselation of the hyperbolic plane (or some higher-dimensional hyperbolic 
space) and their lattice neighborhood reflects the hyperbolic distance metric that 
is responsible for the non-intuitive properties of hyperbolic spaces. 

Since the notion of non-euclidean spaces may be unfamiliar to many readers, 
we first give a brief account of some basic properties of hyperbolic spaces that are 
exploited for hyperbolic SOMs. We then illustrate the properties of hyperbolic 
SOMs with computer experiments focusing on the held of text-mining. 

2 Hyperbolic Spaces 

Surfaces that possess negative gaussian curvature locally resemble the shape of 
a “saddle”, i. e., the negative curvature shows up as a local bending into oppo- 
site normal directions, as we move on orthogonal lines along the surface. This 
may make it intuitively plausible that on such surfaces the area (and also the 
circumference) of a circular neighborhood around a point can grow faster than 
in the uncurved case. Requiring a constant negative curvature everywhere, leads 
to a space known as the hyperbolic plane H2 (with analogous generalizations to 
higher dimensions) T he g eometry of H2 is a standard topic in Rieman- 

nian geometry (see, e. g. jf^llJl)) and the relationships for the area A and the 
circumference C of a circle of radius r are given by 

A = 47Tsinh^(r/2), C = 27rsinh(r) . (1) 

These formulae exhibit the highly remarkable property that both quantities grow 
exponentially with the radius r (whereas in the limit r — >■ 0 the curvature becomes 
insignificant and we recover the familiar laws for flat IB^). It is this property that 
was observed in to make hyperbolic spaces extremely useful for accommodat- 
ing hierarchical structures: their neighborhoods are in a sense “much larger” 
than in the non-curved euclidean (or in the even “smaller” positively curved) 
spaces. 

To use this potential for the SOM, we must solve two problems: (i) we must 
And suitable discretization lattices on H2 to which we can “attach” the SOM 
prototype vectors, (ii) after having constructed the SOM, we must somehow 
project the (hyperbolic!) lattice into “flat space” in order to be able to inspect 
the generated maps. 

2.1 Projections of Hyperbolic Spaces 

To construct an isometric (i. e., distance preserving) embedding of the hyperbolic 
plane into a “flat” space, we may use a Minkowski space 0. In such a space. 
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the squared distance between two points {x,y,u) and {x',y',u') is given by 

i. e., it ceases to be positive definite. Still, this is a space with zero curvature 
and its somewhat peculiar distance measure allows to construct an isometric 
embedding of the hyperbolic plane H2, given by 

X = sinh(p) cos((/>), y = sinh(p) sin(^), u = cosh(p) , (3) 

where (p, </>) are polar coordinates on the H2 (note the close analogy of 031 
with the formulas for the embedding of a sphere by means of spherical polar 
coordinates in ]R^\). Under this embedding, the hyperbolic plane appears as the 
surface M swept out by rotating the curve = 1 + x^ + y'^ about the it-axi^. 

From this embedding, we can construct 
two further ones, the so-called Klein model 
and the Poincare model 0 , 0 , 0 ] (the latter 
will be used to visualize hyperbolic SOMs 
below). Both achieve a projection of the 
infinite H2 into the unit disk, however, at 
the price of distorting distances. The Klein 
model is obtained by projecting the points 
of M onto the plane u = l along rays pass- 
ing through the origin O (see Fig.QJ. Obvi- 
ously, this projects all points of M into the 
“fiat” unit disk a;^ -I- < 1 of IR^ . (e. g., 

B). The Poincare Model results if we 
add two further steps: first a perpendicu- 
lar projection of the Klein Model (e. g., a 
point B) onto the (“northern”) surface of the unit sphere centered at the origin 
(point C), and then a stereographic projection of the “northern” hemisphere 
onto the unit circle about the origin in the ground plane u = 0 (point B). It 
turns out that the resulting projection of H2 has a number of pleasant proper- 
ties, among them the preservation of angles and the mapping of shortest paths 
onto circular arcs belonging to circles that intersect the unit disk at right angles. 
Distances in the original H2 are strongly distorted in its Poincare (and also in 
the Klein) image (cf. Eq. ©), however, in a rather useful way: the mapping 
exhibits a strong “fisheye” -effect. The neighborhood of the H2 origin is mapped 
almost faithfully (up to a linear shrinkage factor of 2), while more distant re- 
gions become increasingly “squeezed” . Since asymptotically the radial distances 
and the circumference grow both according to the same exponential law, the 
squeezing is “conformal”, i. e., (sufficiently small) shapes painted onto H2 are 
not deformed, only their size shrinks with increasing distance from the origin. 

^ The alert reader may notice the absence of the previously described local saddle 
structure; this is a consequence of the use of a Minkowski metric for the embedding 
space, which is not completely compatible with our “euclidean” expectations. 




Fig. 1. Construction steps underly- 
ing Klein and Poincar e-models of the 
space H2 
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Fig. 2. Regular triangle tesselations of the hyperbolic plane, projected into the unit 
disk using the Poincare mapping. The leftmost tesselation shows the case where the 
minimal number (n = 7) of equilateral triangles meet at each vertex and is best suited 
for the hyperbolic SOM, since tesselations for larger values of n (right: n = 10) lead 
to bigger triangles. In the Poincare projection, only sides passing through the origin 
appear straight, all other sides appear as circular arcs, although in the original space 
all triangles are congruent. 



By translating the original H2 the fisheye-fovea can be moved to any other part 
of H2, allowing to selectively zoom-in on interesting portions of a map painted 
on H2 while still keeping a coarser view of its surrounding context. 

2.2 Tesselations of the Hyperbolic Plane 

To complete the set-up for a hyperbolic SOM we still need an equivalent of a 
regular grid in the hyperbolic plane. We the following results |j, tUt|: while the 
choices for tesselations with congruent polygons on the sphere and even in the 
plane such that each grid point is surrounded by the same number n of neighbors 
are severely limited (the only possible values for n being 3,4,5 on the sphere, and 
3,4,6 in the plane), there is an infinite set of choices for the hyperbolic plane. In 
the following, we will restrict ourselves to lattices consisting of equilateral trian- 
gles only. In this case, there is for each n >7 a regular tesselation such that each 
vertex is surrounded by n congruent equilateral triangles. Figure 0 shows two 
example tesselations (for the minimal value of n = 7 and for n = 10), using the 
Poincare model for their visualization. While in Fig.|2|these tesselations appear 
non-uniform, this is only due to the fisheye effect of the Poincare projection. 
In the original H2, each tesselation triangle has the same size, and this can be 
checked by re-projecting any distant part of the tesselation into the center of the 
Poincare disk, after which it looks identical (up to a possible rotation) to the 
center of Fig. 3. 

One way to generate these tesselations algorithmically is by repeated appli- 
cation of a suitable set of generators of their symmetry group to a (suitably 
sized, cf. below) “starting triangle”, for more details cf IlHl . 
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3 Hyperbolic SOM Algorithm 

We have now all ingredients required for a “hyperbolic SOM” . In the following, 
we use the regular triangle tesselation with vertex order n = 7, which leads to 
the “finest” tesselation that is possible (in H2, the angles of a triangle uniquely 
determine its size) . Using the construction scheme sketched in the previous sec- 
tion, we can organize the nodes of such a lattice as “rings” around an origin 
node (i. e., it is simplest to build approximately “circular” lattices). The num- 
bers of nodes of such a lattice grows very rapidly (asymptotically exponentially) 
with the chosen lattice radius R (its number of rings). For instance, for n = 7, 
Table 1 shows the total number 7V/j of nodes of the resulting regular hyperbolic 
lattices with different radii ranging from i? = 1 to i? = 10. Each lattice node r 
carries a prototype vector Wr S from some £)-dimensional feature space (if 
we wish to make any non-standard assumptions about the metric structure of 
this space, we would build this into the distance metric that is used for deter- 
mining the best-match node). The SOM is then formed in the usual way, e. g., 
in on-line mode by repeatedly determining the winner node s and adjusting all 
nodes r G N{s,t) in a radial lattice neighborhood N{s,t) around s according to 
the familiar rule 



AWr = T]hrs{x — Wr) (4) 

with hrs = exp(— d^(r, s)/2cr^). However, since we now work on a hyperbolic 
lattice, we have to determine both the neighborhood N{s, t) and the (squared) 
node distance (P{r,s) according to the natural metric that is inherited by the 
hyperbolic lattice. 



Table 1. Node numbers Nr of hyperbolic triangle lattices with vertex order 7 for 
different numbers R of “node rings” around the origin. 



R 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Nr 


8 


29 


85 


232 


617 


1625 


4264 


11173 


29261 


76616 



The simplest way to do this is to keep with each node r a complex number 
Zr to identify its position in the Poincare model. The node distance is then given 
(using the Poincare model, see e. g. 0) as 



d = 2arctanh 




( 5 ) 



The neighborhood N (t, s) can be defined as the subset of nodes within a 
certain graph distance (which is chosen as a small multiple of the neighborhood 
radius a) around s. 

Like the standard SOM, also the hyperbolic SOM can become trapped in 
topological defects. Therefore, it is also important here to control the neighbor- 
hood radius a{t) from an initially large to a final small value (for details on this 
and some further means to optimize convergence, see U3i). 
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4 Experiments 

Some introductory experiments where several examples illustrate the favorable 
properties of the HSOM as compared to the “standard” euclidean SOM can be 
found in [T^ . 



4.1 Text Categorization 

While - similar as for the SOM 0] - a very high classification accuracy is of a sec- 
ondary importance to visualization, a good classification performance is still im- 
portant to obtain useful maps of text categories. With the ever growing amount 
of available information on the Internet, automatic text categorization based on 
machine learning techniques has become a key task where high-dimensional input 
spaces with few irrelevant features are involved 0. Here the goal is the assign- 
ment of natural language documents to a number of predefined categories (each 
document dj can belong to one, several or none of the categories c^). Achieving 
a high classification accuracy is an important prerequisite for automating high 
volume information organization and management tasks. 



Text Representation. In order to apply the HSOM to natural text catego- 
rization, we follow the widely used vector-space-model of Information Retrieval 
(IR) . We applied a word stemming algorithrr0 such that for example the words 
“retrieved”, “retrieval” and “retrieve” are mapped to the term “retrief”. The 
value of fi of the feature vector f{di) for document dj is then determined by the 
frequency of which term ti occurs in that document. Following standard prac- 
tice 0' we choose a term frequency x inverse document frequency weighting 
scheme: 



/. = muj) log , (6) 

where the term frequency tf(ti,j) denotes the number of times term ti occurs 
in dj, N the number of documents in the training set and dfiff) the document 
frequency of i. e. the number of documents ti occurs in. Additionally, we built 
a stop list of the most and least frequent terms specific to the training set and 
omitted those from the feature vectors, since they have no descriptive function 
with respect to that text corpus. 



HSOM Text Categorization. The HSOM can be utilised for text categoriza- 
tion in the following manner (Fig. |3). In a first step, the training set is used 
to adapt the weight vectors according to 0). During the second step, the 

^ We applied the triestem function of the SMART system by G. Salton and C. Buckley 
(ftp://ftp.cs.cornell.edu/pub/smart/). 
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C1 

C6 



Training 

collection 




(a) Step 1: Training 




finding best match 




Fig. 3. Text categorization with the HSOM: First the training set is used to build an 
internal model of the collection represented by the HSOM’s reference vectors. In (b) for 
each training document the winner nodes are labelled with the document’s category. 
These labels are used in the classification step (c) where an nnknown document is 
“thrown” onto the map and labelled with the categories of its corresponding best 
match node. 



training set is mapped onto the HSOM lattice. To this end, for each training 
example dj its best match node s is determined such that 

\f{dj)-Ws\i^\f{dj)-Wr\ Vr, (7) 

where f{dj) denotes the feature vector of document dj, as described above. After 
all examples have been presented to the net, each node is labelled with the union 
Ur of all categories that belonged to the documents that were mapped to this 
node. A new, unknown text is then classified into the union Us of categories which 
are associated with its winner node s selected in the HSOM. In order to evaluate 
the HSOM’s categorization performance, we furthermore use cos(ws, f{dj)) as 
a confidence measure for the classification result. 



Text Collection. The text collection consists of movie reviews taken from the 
rec. art. movies. reviews newsgroup. Genre information from the Internet Movie 
Database (http://www.imdb.com) was used to build a joined database contain- 
ing the review texts plus the genres from their corresponding movies as the cat- 
egories. To build the training text collection, for each of the most prominent 17 
categories 20 movies were randomly selected. For each of these movies, 3 review 
texts were chosen by chance. Therefore, the training collection contained 1020 
distinct documents. The test text collection was constructed in the same manner 
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with the restriction that it must not contain any document of the training set. 
After word stemming and stop word removal we arrived at approximately 5000 
distinct terms for the construction of the feature vectors. 



Performance Evaluation. The classification effectiveness is commonly mea- 
sured in terms of precision P and recall R 0 , which can be estimated as 

TP TP 

* TPi + FPi' * TP, + FN, ’ ^ 

where TPi and TNi are the numbers of documents correctly classified, and cor- 
rectly not classified to Ci, respectively. Analogous, FPi and FNi are the numbers 
of documents wrongly classified and not classified to c^, respectively. By adjust- 
ing a threshold which is compared with the confidence value cos{Wg, f{dj)) of 
the classifier, the number of retrieved documents can be controlled. In order to 
obtain an overall performance measure for all categories, we applied the microav- 
eraging method |2ll| . Furthermore, the breakeven point of precision and recall, 
i. e. the value at which P = i? is a frequently given sing le number to measure 
the effectiveness determined by both values P and R fll]. 

In order to assess the HSOM’s performance for text categorization, we have 
used a fc-nearest neighbour (fc-NN) classifier which was found to show v ery good 
results on text categorization tasks 0 . Apart from boosting methods only 
support vector machines ^ have shown better performances. The confidence 
level of a fc-NN classifier to assign document dj to class Ci is 

^ ai^ ■ cos{dj,dz) , (9) 

dz aT Rk {dj ) 



where TRk(dj) is the set of k documents dz for which cos{dj,dz) is maximum. 
The assignment factor aiz is 1, if dz belongs to category a and 0 otherwise. 
According to 0, we have chosen the A: = 30 nearest neighbours. 



Text Categorization Results. Precision-recall-diagrams for three categories 
and the microaveraged diagrams for all categories are shown in Fig. El The single 
category and microaveraged break-even points are layed out in Table 2. 

It is notable that the HSOM performs significantly worse if only a few doc- 
uments are recalled, but the precision in cases of high recall values is very close 
to that of the fc-NN classifier. Since one is usually interested in high precision 
in conjunction with high recall, the suboptimal results for low recall values do 
not really affect the usefulness of the HSOM for the purpose of text categoriza- 
tiorj^. Thus, our results indicate that the HSOM does not perform better than 
a fc-NN classifier, but it does not play significantly worse either. Since the main 

® We also believe that a more clever heuristic than the simple distance to the bestmatch 
node in order to determine the evidence value of a classification will further improve 
accuracy for low retrieval rates. 
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purpose of the HSOM is the visualization of relationships between texts and 
text categories, we believe that the observed categorization performance of the 
HSOM compares sufficiently well with the more specialized (non-visualization) 
approaches to warrant its efficient use for creating insightful maps of large bodies 
of document data. 




(a) fc-NN 



(b) HSOM 



(c) Microaveraged 



Fig. 4. Precision-recall-diagrams for the three categories Drama, Thriller and Ro- 
mance. (a) shows the results for the fc-NN classifier, (b) for the HSOM. In (c) the 
microaveraged diagrams for both methods are shown. 



Table 2. Precision-recall breakeven points for the most prominent categories. In most 
cases the fe-NN performs better than the HSOM, but for the categories “Animation”, 
“Fantasy” and “Musical” the HSOM yields better results. 





Action 


Advent. 


Animation 


Comedy 


Crime 


Docum. 


Drama 




HSOM 


81.6 


75.4 


86.9 


81.3 


84.5 


86.7 


82.5 




fc-NN 


87.3 


83.0 


84.5 


87.6 


90.5 


98.0 


85.8 






Fantasy 


Horror 


Musical 


Mystery 


Romance 


Sci-Fi 


Thriller 


M-avg. 


HSOM 


81.6 


78.6 


82.5 


84.6 


82.8 


76.2 


86.8 


81.1 


fc-NN 


75.0 


88.9 


81.2 


86.1 


87.8 


89.3 


89.1 


86.4 



4.2 Semantic Browsing 

A major advantage of the HSOM is its remarkable capability to map high- 
dimensional similarity relationships to a low-dimensional space which can be 
more easily handled and interpreted by the human observer. This feature and 
the particular “fisheye” capability motivates our approach to visualize whole 
text collections with the HSOM. With just as little as 5 rings (c.f. Table 1), we 
can handle well over 500 prototype vectors which are able to represent different 
types of texts. The nodes are labelled with those document titles which resemble 
their prototype vectors most closely. We additionally map symbols to the nodes 
which correspond to the categories associated with the prototypes. We can now 
interactively change the visual focus to those regions which show an interesting 
structure. In Fig. Ela) for example we have “zoomed” into a region of the map 
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which indicated a cluster of “Animation” films. As a closer inspection shows, 
this region of the map resembles movie reviews all connected to Disney’s typical 
animations released during Christmas time. In Fig. m the focal view was 
moved to a region connected to “Action” and “Fantasy” films. It does not only 
show the movies of the “Batman” series in a limited area of the map, but also 
a “Zorro” movie in the neigborhood - which makes a lot of sense, as the main 
characters of the films indeed have a semantic relation. 





Fig. 5. By moving the visual focus of the HSOM, a large text collection can be browsed 
quite elegantly. In this example, the movie titles of their corresponding review texts 
(there might be several reviews for one movie) are mapped. In (a) reviews of Disney’s 
animations have been moved into the centre, (b) shows a region of the map containing 
the “Batman” movies. Note, that the HSOM also mirrors the semantical connection 
between “Zorro” and “Batman”. 



As illustrated in Fig. the HSOM might 
also be used to classify an unknown text by 
displaying its relationship to a previously ac- 
quired document collection. In this example 
an unknown review text for the film “Juras- 
sic Park” was mapped. The map was then 
zoomed to that node which most closely re- 
sembled the given input text, which in this 
case was another review describing the same 
film. Note, that the neighborhood is occu- 
pied by reviews describing the sequel “The 
Lost World”, respectively the “Dinosaurs” 
animation. By mapping a complete unknown 
document collection to a previously formed 
HSOM, relevant text documents can be discovered. In Fig. Q the HSOM is used 
as a filter to display only those documents which belong to a semantic region of 
interest. 




Fig. 6. Mapping a new text. 
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(a) (b) (c) 

Fig. 7. Discovering of relevant documents in whole collections. In this example an 
unknown text collection is mapped onto the HSOM, but only those items are visualized 
which belong to a selected category, in this case animation movies. The central view in 
(a) points to a set of document clusters, which can be interactively zoom into. In (b) 
the largest cluster has been moved into focus. It mainly shows Disney animations and 
dims from the “Toy story” series a bit farther down. In (c) the cluster in the top left 
contains “A Bug’s Life” and “Antz”, whereas the cluster in the bottom right belongs 
to movies connected to Japanese animations. 



5 Discussion 

Our results show that the HSOM is not only applicable to automated text cat- 
egorization, but also specifically well suited to support “semantic browsing” in 
large document collections. Our first experiments indicate that the HSOM is able 
to exploit the peculiar geometric properties of hyperbolic space to successfully 
compress complex semantic relationships between text documents onto a two 
dimensional projection space. Additionally, the use of hyperbolic lattice topol- 
ogy for the arrangement of the HSOM nodes offers new and highly attractive 
features for interactive navigation in this way. Large document databases can 
be inspected at a glance while the HSOM provides additional information which 
was captured during a previous training step, allowing e. g. to rapidly visualize 
relationships between new documents and previously acquired collections. 

Future work will address more sophisticated visualization strategies based on 
the new approach, as well as the evaluation for other widely used text collections, 
such as Reuters, Ohsumed or Pubmed. 
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Abstract. This paper discusses the clustering quality and complexities of the 
hierarchical data clustering algorithm based on gravity theory. The gravity- 
based clustering algorithm simulates how the given N nodes in a A'-dimensional 
continuous vector space will cluster due to the gravity force, provided that each 
node is associated with a mass. One of the main issues studied in this paper is 
how the order of the distance term in the denominator of the gravity force for- 
mula impacts clustering quality. The study reveals that, among the hierarchical 
clustering algorithms invoked for comparison, only the gravity-based algorithm 
with a high order of the distance term neither has a bias towards spherical clus- 
ters nor suffers the well-known chaining effect. Since bias towards spherical 
clusters and the chaining effect are two major problems with respect to cluster- 
ing quality, eliminating both implies that high clustering quality is achieved. 
As far as time complexity and space complexity are concerned, the gravity- 
based algorithm enjoys either lower time complexity or lower space complex- 
ity, when compared with the most well-known hierarchical data clustering algo- 
rithms except single-link. 

Keywords: data clustering, agglomerative hierarchical clustering, gravity force. 



1 Introduction 

Data clustering is one of the most traditional and important issues in computer science 
[4, 7, 9, 10]. In recent years, due to emerging applications such as data mining and 
document clustering, data clustering has attracted a new round of attention in com- 
puter science research communities [3, 5, 6, 11, 14, 17, 19]. One traditional taxon- 
omy of data clustering algorithms that work on data points in a ^T-dimensional con- 
tinuous vector space is based on whether the algorithm yields a hierarchical clustering 
dendrogram or not [10]. One major advantage of the hierarchical clustering algo- 
rithms is that a hierarchical dendrogram is generated. This feature is very important 
for applications such as in biological, social, and behavior studies, due to the need to 
construct taxonomies [9]. Furthermore, as Jain, Murty, and Flynn summarized [10], 
hierarchical clustering algorithms are more versatile than non-hierarchical algorithms, 
or so-called partitional algorithms. For example, most partitional algorithms work 
well only on data sets containing isotropic clusters. Nevertheless, hierarchical cluster- 
ing algorithms suffer higher time and space complexities [10]. Therefore, a latest 
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trend is to integrate hierarchical and partitional clustering algorithms such as in 
BIRCH[19], CURE[5], and Chameleon[ll]. In the kernel of these algorithms, a hier- 
archical clustering algorithm can be invoked to derive a dendrogram and to improve 
clustering quality. Due to this trend, it is expected that hierarchical clustering algo- 
rithms will continue to play an important role in applications that require a dendro- 
gram. Furthermore, clustering quality becomes the prevailing concern in comparing 
various hierarchical clustering algorithms. 

This paper discusses the clustering quality and complexities of the hierarchical data 
clustering algorithm based on gravity theory in physics. The gravity theory based 
clustering algorithm simulates how the given N nodes in a ^f-dimensional continuous 
vector space will cluster due to gravity force, provided that each node is associated 
with a mass. The idea of exploiting gravity theory in data clustering was first pro- 
posed by W. E. Wright in 1977 [16]. In the article, Wright discussed several factors 
that may impact clustering quality. Nevertheless, one crucial factor that was not ad- 
dressed in Wright's article is the order of the distance term in the denominator of the 
gravity force formula. As we know, the order of the distance term is 2 for the gravity 
force. However, there are natural forces of which the magnitude of influence de- 
creases much rapidly as distance increases. One such force is the strong force in atom 
nuclei. This observation inspired us to investigate the effect of the order of the dis- 
tance term. In this paper, we still use the term "gravity force", even though we em- 
ploy various orders of the distance term in the simulation model. 

The experiments conducted in this study shows that the order of the distance term 
does have a significant impact on clustering quality. In particular, with a high order 
of the distance term, the gravity-based clustering algorithm neither has a bias towards 
spherical clusters nor suffers the well-known chaining effect [10,17]. Figure 1 
exemplifies how bias towards spherical clusters impacts clustering quality. In Fig. 1, 
the data points at the two ends of the two dense regions are clustered. As will be 
shown in this paper, except the single-link algorithm, all the conventional hierarchical 
clustering algorithms studied in this paper as well as the gravity-based clustering 
algorithm with a low order of the distance term have a bias toward spherical clusters. 
On the other hand, the single-link algorithm suffers the well-known chaining effect. 
Since bias towards spherical clusters and the chaining effect are two common prob- 
lems with respect to clustering quality, avoiding both implies that high clustering 
quality is achieved. 




Fig. 1. An example of how a clustering algorithm with bias towards spherical clusters suffers 
poor clustering quality. The data points in the two clusters are marked by 0 and 1, respectively. 
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As Wright did not address time and space complexities of the gravity-based clus- 
tering algorithm, we conduct a detailed analysis in this paper. Table 1 compares the 
time complexity and space complexity between the gravity-based clustering algorithm 
and the most well-known hierarchical clustering algorithms [2, 10] that work in 
spaces of any degrees of dimension. The hierarchical clustering algorithms that work 
only in low-dimensional spaces [1, 12] are not included for comparison. The time 
and space complexities of the gravity-based algorithm reported in Table 1 are based 
on the simulation model employed in this paper, which is slightly different from the 
model employed in Wright's paper. Though Wright did not analyze the time and 
space complexities of his algorithm, our simulation results show that Wright's simula- 
tion model has the same orders of complexities as the simulation model employed in 
this paper. As Table 1 reveals, the gravity-based clustering algorithm enjoys either 
lower time complexity or space complexity, when compared with the most well- 
known hierarchical clustering algorithms except single-link. 

In the following part of this paper. Sect. 2 elaborates how the gravity-based cluster- 
ing algorithm works. Section 3 analyzes its time complexity and space complexity. 
Section 4 reports the experiments conducted to compare the gravity-based algorithm 
with the most well-known hierarchical clustering algorithms. Finally, concluding 
remarks are presented in Sect. 5. 



Table 1. Time and space complexities of the gravity-based clustering algorithm and the most 
well-known hierarchical clustering algorithms. 



Clustering Algorithm 


Time complexity 


Space complexity 


The gravity-based algorithm 


O(N^) 


0{N) 


Single-Link [2] 


O(N^) 


0{N) 


Complete-Link [10, 13] 


0(N^\ogN) 


O(N^) 


Centroid [2], Group Average [2] 


0{N^\og^ N) 


0(N) 



2 The Gravity-Based Clustering Algorithm 

The simulation model that the gravity-based data clustering algorithm assumes is an 
analogy of how a number of water drops move and interact with each other in the 
cabin of a spacecraft. The main difference between the simulation model employed 
in this paper and that employed in Wright's paper is the order of the distance term in 
the denominator of the gravity force formula. This paper studies how different orders 
of the distance term impact clustering quality. Another difference is that the effect of 
air resistance is taken into account in this paper for guaranteeing termination of the 
algorithm, which was not addressed in Wright's paper. 

Now, let us elaborate the simulation model employed in this paper. Due to the 
gravity force, the water drops in the cabin of a spacecraft will move toward each 
other. When these water drops move, they will also experience resistance due to the 
air in the cabin. Whenever two water drops hit, which means that the distance be- 
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tween these two drops is less than the lumped sum of their radii, they merge to form 
one new and larger water drop. In the simulation model, the merge of water drops 
corresponds to forming a new, one-level higher cluster that contains two existing 
clusters. The air resistance is intentionally included in the simulation model in order 
to guarantee that all these water drops eventually merge into one big drop regardless 
of how these water drops spread in the space initially. Before examining the details 
of the simulation algorithm, let us first discuss some important observations based on 
our physical knowledge. 



Observation 1 : As time elapses, the system can not continue to stay in a state in which 
there are two or more isolated water drops and all these water drops stand still. 

Reason: 

The system may enter such a state but will leave that state immediately due to gravity 
forces among these water drops. 



Observation 2: As long as the system still contains two or more isolated water drops, the 
lumped sum of the dynamic energy and potential energy in the system will continue to 
decrease as time elapses. 

Reason: 

Due to Observation 1, if the system still contains two or more isolated water drops, 
then these water drops can not all stand still indefinitely. As some water drops move, the 
dynamic energy will gradually dissipate due to air resistance. Actually, the dissipated 
dynamic energy is turned to another form of energy. That is heat. Furthermore, as the 
dynamic energy in the system continues to dissipate, the potential energy in the system 
will gradually convert to dynamic energy. Since the dissipation of dynamic energy is a 
non-stopping process as long as there are two or more isolated water drops in the system. 
The lumped sum of dynamic energy and potential energy in the system will continue to 
decrease as time elapses. 



Observation 3: Regardless of how the water drops spread in the space initially, all water 
drops will eventually merge into one big drop. 

Reason: 



Assume that there is an initial spreading of water drops such that the system never 
reaches a state in which all water drops merge into one big water drop. Let ENG(t) de- 
note the lumped sum of the potential energy and dynamic energy in the system. Accord- 
ing to Observation 2, ENG(t) is a monotonically decreasing function as long as there are 
two or more isolated water drops in the system. Since ENG(t) > 0 at any time, there is 



a number a > 0 



hm^'^ = 0 . 
dt 



such that limENG(t) = a . \imENG(t) = a implies 

f— >oo f— 

Because the air resistance force experienced by a moving water 
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drop is proportional to the square of its moving velocity, lim 



dENGjt) 

dt 



0 implies the 



velocities of all water drops will approach 0 as time elapses. However, just like the rea- 
son for Observation 1 above, the velocities of water drops can not all approach 0 as time 
elapses, because the gravity forces will accelerate them. Therefore, a contradiction 
would occur, if our assumption held. Hence, the system will eventually reach a state in 
which all water drops merge into one big drop. 



The physical ohservations discussed above implies that the gravity-based data clus- 
tering algorithm based on simulating the physical system discussed above will even- 
tually terminate. Following is a formal description of the simulation model. 



1 . 

2 . 

3. 



4. 



5. 



Each of the N initial nodes in the A'-dimensional space is associated 
with a mass Mg . 



There are two forces applied to the nodes in the system. The first one 
is gravity force and the second one is air resistance. 



The gravity force applied 
equal to: 

C XM, XM, 

f = ^ 



LU LWU 









r 

where is a constant, Mj and are the masses of these two 
nodes, and A: is a positive integer. 

The nodes will suffer air resistance when they move. The air resis- 
tance force F^ that a moving node experiences is equal to 

E,=C,xv^ 

where Cy is a constant and v is the velocity of the node. 



At any time, if two nodes are apart by a distance less than the sum of 
their radii, then these two nodes will merge to form a new node with 
lumped mass. The radius of the new node is determined by the mass 
of the node and a given constant, which denotes the density of the ma- 
terial. As far as momentum is concerned, the momentum of the new 
node is equal to the addition of the momentums of the original nodes. 
The merge of two nodes corresponds to forming a new, one-level 
higher cluster that contains two existing clusters represented by the 
two original nodes. 



Figure 2 shows the pseudo-code of the gravity-based clustering algorithm. Basi- 
cally, the algorithm iteratively simulates the movement of each node during a time 
interval T and checks for possible merges. The algorithm terminates when all nodes 
merge into one big node. 
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¥h the set containing all disjoint nodes. At the beginning, ¥! 
contains all initial nodes. 

Repeat 

For every G W { 

calculate the acceleration of W^- based on the gravity 

forces applied on W^- by other nodes in VI and the mass of 
W,. ; 

calculate the new velocity of W^- based on its current ve- 
locity and acceleration; 

calculate the new position of based on its current ve- 

locity; 

}; 

For every pair of nodes W^- , W j G W { 

if (Wj-andWy hit during the given time interval T) { 

create a new cluster containing the clusters represented 
by W- and Wj ; 

merge W- and Wj to form a new node with lumped 

masses and merged momentum; 

}; 

}; 

Until {VI contains only one node) ; 

Fig. 2. Pseudo-code of the gravity-based data clustering algorithm 



3 Time and Space Complexity of the Gravity-Based Algorithm 

We will employ a probability model to prove that the time complexity of the gravity- 
based data clustering algorithm is O(N^) , where N is the number of nodes initially. 
The proof is based on the observation that the expected number of disjoint nodes 
remaining after each iteration of simulation decreases exponentially. 

Assume that these N initial nodes randomly spreading in a A'-dimensional Euclid- 
ian space bounded by [X^i,X^f^],[X2i,X2fi], ,[X i^i , X , where A^.,and A^^^are 

the lower bound and upper bound in the j-th dimension, respectively. Depending on 
how these N nodes initially spread in the bounded space, the number of disjoint nodes 
remained after the i-th iteration of the gravity-based data clustering algorithm may 
differ. Let denote the random variable that corresponds to the number of disjoint 
nodes after the i-th iteration of the gravity-based algorithm. It has been proved that 
all the nodes will eventually merge into one big node. Therefore, if the number initial 
nodes N and the boundary in the A-dimensional space are determined, then there 
exists an integer number S such that all nodes merge into one big node after S itera- 
tions of the gravity-based algorithm regardless of how these N nodes spread in the 
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bounded space initially. Let E[Ni ] denote the expected value of random variable N 
E[Nm] 



and q = Maximuml - 

£[iV,.] 

Then, we have 0 < 17 < 1 and 



), where 0< i < ^-l and £[A^ol = E[Ng] = 1 . 

E[Ni]<Nxq‘ ( 1 ) 



One important attribute of q that we will exploit later is that q decreases as the 
number of initial nodes N increases, as long as the boundary in the A'-dimensional 
space in which the initial nodes spread does not change with the value of N. As the 
number of nodes in a fixed-size space increases, the probability that two nodes hit 
during a time interval increases. As a result, q decreases as N increases. 

To determine the time complexity of the graivity-based data clustering algorithm, 
we need to determine the number of operations performed in each iteration of the 
algorithm. In each iteration of the algorithm, we need to compute the distance be- 
tween each pair of disjoint nodes and check for possible merges. The complexity of 
carrying out these two operations is in quadratic order. The complexities of all other 
operations executed in one iteration are in lower orders and thus can be ignored in 
determining time complexity. Therefore, the time complexity of the gravity-based 
data clustering algorithm is equal to 

s-i 

! E[C xNj ] , where C is a constant. 

i=0 

J-l S-1 S-1 N 

\ E[CxN^] = Cx\ E[Ni^] = Cx\ \ Probability (N ^ = l)xl^ 

i=0 i=0 i=0 l=\ 



5-1 N 5-1 N 

<CxJ J Probability(A/^,- = /)x/x = Cx A^x I J Probability(//,- = /)x/ 
/=0 /=! /=0 /=! 



5-1 5-1 

= CxNx ! E[Ni] <CxNx\ Nxq\ according to (1) above 

1=0 i=0 



5-1 

= CxN^x\ q‘=Cx 
1=0 



iiV 

1-^ 



■xN" 



□ 



As elaborated earlier, q decreases as N increases and 0 <i 7 <l. Therefore, term 



1-r 

\-q 



decreases as N increases and the time complexity of the gravity-based data 



clustering algorithm is O(N^) . The space complexity of the algorithm is 0(N) , 
because the space complexity of the hierarchical dendrogram built by the clustering 
algorithm is 0(N) and in each iteration we need to compute and store the location, 
velocity, and acceleration of each disjoint node. 
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4 Experimental Results 

This section reports the experiments conducted to study how the gravity-hased clus- 
tering algorithm performs in practice. The first part of this section reports how vari- 
ous algorithms perform with respect to clustering quality. The second part of this 
section reports the execution times of the gravity-hased algorithm when running on 
real data sets. The clustering algorithms included in the clustering quality comparison 
are as follows: 

(1) the gravity-based clustering algorithm based on our simulation model with 
the order of the distance term set to 5; 

(2) the gravity-based clustering algorithm based on Wright's model with the or- 
der of the distance term set to 2; 

(3) four conventional hierarchical clustering algorithms: single-link[9, 10], com- 
plete-link[9, 10], group-average[9, 10], and centroid[9, 10]. 

Table 2 shows how the parameters in the gravity-based algorithms were set in these 
experiments. According to our experiences, the order of the distance term has the 
dominant effect on clustering quality. With the order of the distance term set to 5 or 
higher, the gravity-based clustering algorithm neither has a bias towards spherical 
clusters nor suffers the chaining effect. The settings of Cg , Mg, and T mainly affect 
how rapidly the nodes merge in iterations. Employing small values for these parame- 
ters may result in more iterations in simulation. Nevertheless, the effect does not 
change the order of the time complexity of the algorithm. It only affects a coefficient 
in the time complexity formula. The remaining two parameters, Cr and D^, have 
essentially no effect on clustering quality or speed of converging as long as they are 
not set to some weird values. For example, the coefficient of air resistance should not 
be set so high that the nodes can hardly move and the material density of the nodes 
should not be set so high that all the nodes have virtually no volume and can hardly 
hit each other. 



Table 2. The parameter settings in the gravity-based algorithms. 



Figures 3~5 show three experiments conducted to study the clustering quality of 
different algorithms. These figures only plot the remaining clusters before the last 
merge of clusters is executed for better visualization quality. In these figures, differ- 
ent clusters are plotted using different marks. The experimental results presented in 
Fig. 3 show the effect caused by bias towards spherical clusters. As shown in Fig. 3, 
except the gravity-based algorithm with a high order of the distance term and the 
single-link algorithm, all other algorithms have a bias towards spherical clusters and 
generate clusters that contain data points from both separate dense regions. The ex- 



q : the order of the mass term 


0 


6 : the distance that the node with 
maximum velocity moves in one it- 
eration 


1 



lb) Wright’s simulation model 



Cg : Gravity force coefficient 


30 


Mo : Initial mass of each node 


1 


Cr : Air resistance coefficient 


0.01 


D„ : Material density of the node 


1 


T : Time interval of each iteration 


1 



(a) The simulation model employed in this 
paper 
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perimental results presented in Fig. 4 show the well-known chaining effect. In this 
case, only the single-link algorithm suffers the chaining effect. As shown in Fig. 4b, 
the cluster containing the data points marked by “O” extends to both spheres. Fig. 5 
shows a data set designed to test how each algorithm handle both bias towards spheri- 
cal clusters and the chaining effect. As shown in Fig. 5, only the gravity-based algo- 
rithm with a high order of the distance term neither has a bias towards spherical clus- 
ters nor suffers the chaining effect. 

We also use three 3-dimensional data sets to compare the clustering quality of 
various algorithms. Figure 6a depicts the shapes of the 3 data sets used in the experi- 
ments and the numbers of data points in these data sets, respectively. Figure 6b sum- 
marizes the clustering quality of various algorithms. Again, only the gravity-based 
algorithm with a high order of the distance term neither has a bias toward spherical 
clusters nor suffers the chaining effect. 




(a)The simulation model in (b) Single-link (c) Complete-link 

this paper 




(d) Group average (e) Centroid (f) Wright’ s simulation model 

Fig. 3. Clustering results of the first experiment 




(d) Group average (e) Centroid (f) Wright’s simulation model 



Fig. 4. Clustering results of the second experiment 
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(a) The simulation model in 
this paper 






III 







(c) Complete-link 




(d) Group average 




(e) Centroid 



UHaiiilUIMlIul 

(f) Wright’s simulation model 



Fig. 5. Clustering results of the third experiment 

The experimental results above show that the gravity-based algorithm with a high 
order of the distance term is not biased towards spherical clusters. However, if the 
order of the distance term is low, then the situation may be different. We must resort 
to our physical intuition to explain this phenomenon. With a high order of the dis- 
tance term, the influence of the gravity force decays rapidly as distance increases. 
Therefore, data points separated by a channel feel virtually no influence from each 
other. 




Data set 1 
(808 points) 




Data set 2 
(1800 points) 

(a) The 3 data sets used. 




Data set 3 
(4175 points) 






Data set 1 


Data set 2 


Data set 3 


Single-link 


Poor (due to 
chaining effect) 


Good 


Poor (due to the chain- 
ing effect) 


Complete-link 


Good 


Poor (bias towards 
spherical clusters) 


Poor (bias towards 
spherical clusters) 


Group average 


Good 


Poor (bias towards 
spherical clusters) 


Poor (bias towards 
spherical clusters) 


Centroid 


Good 


Poor (bias towards 
spherical clusters) 


Poor (bias towards 
spherical clusters) 


Wright’s model 


Good 


Poor (bias towards 
spherical clusters) 


Poor (bias towards 
spherical clusters) 


The simulation 
model in this paper 


Good 


Good 


Good 



(b) Summary of clustering quality of various algorithms 



Fig. 6. The experiments conducted on 3-D data sets 
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As far as execution time is concerned. Fig. 7 shows how the execution time of the 
gravity-based algorithm increases with the number of initial nodes to be clustered. 
The experiment was conducted on a machine equipped with a 700-MHz Intel Pen- 
tium-III CPU and 786 Mbytes main memory and running Microsoft Window 2000 
operating system. The data set used is the Sequoia 2000 storage benchmark[15], 
which contains 62556 nodes in total. In this experiment, we randomly selected a 
subset of nodes from the benchmark dataset. The results in Fig. 7 confirm that the 
time complexity of the gravity-based clustering algorithm is 0{N ) . 

log 2 V time in seconds 




300 600 1200 2400 4800 9600 

Number of initial nodes 



Fig. 7. Execution time of the gravity-based clustering algorithm versus the number of initial 
nodes to be clustered 



5 Conclusions 

This paper studies the clustering quality and complexities of a hierarchical data clus- 
tering algorithm based on gravity theory in physics. In particular, this paper studies 
how the order of the distance term in the denominator of the gravity force formula 
impacts clustering quality. The study reveals that with a high order of the distance 
term, the gravity-based clustering algorithm neither has a bias towards spherical clus- 
ters nor suffers the chaining effect. Since bias towards spherical clusters and the 
chaining effect are two major problems with respect to clustering quality, eliminating 
both implies that high clustering quality is achieved. As far as time complexity and 
space complexity are concerned, the gravity-based algorithm enjoys either lower time 
complexity or lower space complexity, when compared with the well-known hierar- 
chical data clustering algorithms except single-link. 

As discussed earlier, a latest trend in developing clustering algorithms is to inte- 
grate hierarchical and parti tional algorithms. Since the general properties of the grav- 
ity-based algorithm with respect to clustering quality are similar to those of the den- 
sity-based partitional algorithms such as DBSCAN [3] and DENCLUE [8], it is of 
interest to develop a hybrid algorithm that integrates the gravity-based algorithm and 
the density-based algorithm. In the hybrid algorithm, the gravity-based algorithm is 
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invoked to derive the desired dendrogram. This is the follow-up work that we have 
been investigating. 
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Abstract. Most search engines return a lot of unwanted information. A 
more thorough filtering process can be performed on this information to 
sort out the relevant documents. A new method called Frequency Domain 
Scoring (EDS), which is based on the Fourier Transform is proposed. 
EDS performs the filtering by examining the locality of the keywords 
throughout the documents. This is examined and compared to the well 
known techniques Latent Semantic Indexing (LSI) and Cosine measure. 
We found that EDS obtains better results of how relevant the document 
is to the query. The other two methods (cosine measure, LSI) do not 
perform as well mainly because they need a wider variety of documents 
to determine the topic. 



1 Introduction 

The ability of automatically classifying a document accurately has become an 
important issue in the past few years due to the exponential growth of the 
Internet and the availability of on-line information. Many methods such as topic 
identification have been tried by search engines creators and abused by web page 
writers who try their best to mislead the search engine so that their page appears 
at the top of the search results. 

At present, the only way to find any useful information on the Internet is to 
use a search engine and manually sort through all of the results returned. 

There has been a huge interest in relevant document retrieval, and several 
people have developed methods to allow the user to obtain the right information. 
For example, Spertus cm uses different types of connectivity and spatial locality 
to detect relevant pages; Mladenic jOj examines the pages previously visited by 
the user and uses these examples to learn what to retrieve in the future; Carriere 
et al. |2] calculates the score of a page based on how many relevant pages point 
to it through links; Ngu et al. [Z| recommend that rather than search engines 
maintaining information about the entire Internet, each site should produce the 
information needed to produce better search results; Howe et al. jS| queries the 
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existing range of search engines to obtain the best results from the collection of 
pages returned. 

The method proposed here will examine the documents searched and try to 
find those with the topic requested by giving a score based on their spectrum, 
so that the user obtains the documents he/she truly requires. 

This paper will proceed as follows, Sect. 2 contains a description of the doc- 
ument filtering process. Sect. 3 gives the problem formulation and explains how 
the current methods of filtering are not using all of the document information. 
Sect. 4 introduces the FDS method and explains how to calculate the score. 
Sect. 5 contains a short discussion on the computational complexity of the FDS 
method. Sect. 6 shows results from two separate experiments (one based on the 
TREC database and the other from three actual Internet searches) and gives an 
analysis of the results from both experiments, and finally the conclusion is given 
in Sect. 7. 



2 Document Filtering 

The objective of document filtering is to extract all of the relevant documents 
related to a certain topic from a set of documents of unknown topics. Examples of 
document filtering methods include the cosine measure, latent semantic indexing 
(LSI) and the new superior method Frequency Domain Scoring (FDS) introduced 
in this paper. 

The methods used in this paper perform the filtering on a document set con- 
sidered relevant by a selection of Internet search engines. Therefore the document 
filtering will be performed on a local machine rather than a remote machine. 

Even though the trials were performed on the results of a few Internet search 
engines, these methods could easily be incorporated in the search engine itself. 



3 Problem Formulation 

Let A{t) be the entire collection of documents on the Internet at time t, where 
A{t) = {do) di, . . . , d^y} and 0 < Af < oo is the number of documents on the In- 
ternet at time t. Each document d„ is represented as the tuple {f„, £„} where i is 
the information contained in document n and £„ C A{t) is the set of documents 
which can be accessed through d„ via hypertext links. 

There exists a set TZt C A{t), where TZt is the set of all relevant documents 
to topic T G T (the topic space). We want a function 5 : T — >■ A{t) such 
that S{T) = TZt- The current non-existence of the function S is the reason 
why search engines (which try to approximate 5) do not always return correct 
results. Internet search engines will give us a subset of TZt and a set of irrelevant 
information (related to the search topic T) . This can be represented as Rt U Et 
where Rt C TZt and Et C A{t)\TZT- 

Rather sifting through the entire collection of documents on the Internet 
(^(t)), we will use the results from several search engines {Rt U Et) and try to 
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The Internet 




Fig. 1. This figure gives a visual example of the problem at hand. It can be seen that 
the set of documents returned by search engines contain irrelevant information. The 
problem is how do we extract the useful information from this set given to us 



extract the set of documents Rt or remove the unwanted (irrelevant) Et- This 
is shown in Fig. ^ This approach will lead to a good approximation of S. 

Current searching methods use the following approach. By defining a function 
g : A{t) — >• we are able to represent S A{t) as an M dimensional vector 

in real space. The mapping is performed by treating each word in the document 
as a single dimension, the number of times that word appears in the document 
will be its value. Therefore 

g{dn) = Sn= [c{dn,Wi) c(d„, W 2 ) . . . c{dn,WM)]’^ (1) 

is a document vector, where c(d„, Wm) £ N is the frequency count of word Wm, 
all Wm are unique and the dictionary contains M words. This document vector 
is then used to give the relevance for the document. 

The above mapping of the document space into the M dimensional real space 
causes all of the important spatial information of the documents (the order of 
the words) to be lost. This also applies to the topic spatial information. 

4 Frequency Domain Scoring 

When a few key words are entered to search for, they are usually on the one 
topic. For example, if the words “aluminium cricket bat” is entered, we would 
expect to get documents on aluminium cricket bats. The classification methods 
listed so far would also return documents on cricket bats and aluminium. 

To make use of the spatial information of the document, the vectors used 
here represent the positions of the search terms throughout the documents. 
Documents which have keywords appearing periodically and which contain the 
keywords together are given a higher relevance than the documents that have 
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the keywords apart. To analyse the relative positions the vectors are mapped 
into the frequency domain. 

Just as the discrete Fourier transform can map discrete time intervals in to 
the frequency domain, so to can it map discrete word spatial intervals into the 
frequency domain, as shown in Fig. 0 




Fig. 2. This picture gives a visual example of how the Frequency Domain Scoring is 
performed. As shown, for each word in the search term, the document is split into 
equally sized bins. The value of each bin is the frequency of the word within that 
section of text. The DFT is performed and the magnitude and phase results are used 
to give the document score 



By counting the number of appearances of a word in a document, we can 
treat the word position as the position in time. Performing the DFT allows us to 
observe the spectrum of the word. By splitting the spectrum into the magnitude 
and phase, we can see the power and delay of the word at certain frequencies. 

By treating each word as a discrete time interval, we get a string of ones and 
zeros. To be more efficient, sequences of words can be clustered into bins (eg. 
the first fifty words in bin 1, the second fifty in bin 2, ...). This reduces the size 
of the input to the DFT and also gives larger counts than one in each bin. 

Once the DFT is performed, the word spectrum shows the frequency com- 
ponents of which the word signal is made up. Each frequency component is a 
complex number of the form where Hf gM. represents the power of the 

frequency component /, and G R is the phase shift of /. 

Terms made from several words are normally the topic of the document when 
the words appear close together and periodically. Therefore a document in which 
frequency / has a large magnitude (i?/) for all of the words from the set T, and 
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the phases of each word from T are similar, then it is most likely that the T is 
a subset of the topic. 

In mathematical terms, given a query T where 

T = {wi,W2,-..,wm} (2) 

we can define a counting function cf : A{t) x T — ?> 

cf (^dji , Wjn ) = Cl (^dji , Wyji) I (d C2 (^dji , / [3 .. . Cs {dn , Wm.^ / (3) 

where Cb{dn, Wm) is the count of word Wm in bin 1 < & < B of document dn and 
P is the number of words per bin. The spectrum of cf can be found using the 
Fourier transform. 



Ci^dni Wrti) — 3F [cf TCtti)] 









H. 



(n,m) 



(n,m) 



riQ e 



m) 



( 4 ) 

( 5 ) 



where J- is the Fourier Transform, g R and G M are the magni- 

tude and phase of the &th frequency bin from the nth document and mth word 
respectively. If 

and 

rM > ^ V ^ (7) 

where e is a small value and if is a large value, then more likely that d„ G TZt ■ 
Therefore we want to give a higher relevance score to those documents with a 
higher magnitude and lower variance in phase of each frequency component. 

The measure of variance does not take into account the circular data of the 
phase (modulo 2tt). To overcome this problem, a measure of precision (rather 
than variance, not to be confused with the precision measure of document re- 
trieval) was used. If 



M 

Mn) _ 

m—1 



COS 



An,m) 



M 



and 



o(n) _ 1 

- M 2^ '^'b 



Sin ( 



(n,m) 



then the precision (r) and mean (p) are defined by 

and 



(jA) ^ ^(n) ^(n) 



gA) ^ ^(n) ^(n) 



( 8 ) 

( 9 ) 



SO 

f<"> = \/(c<">)%(si">)’ (10) 

The precision has a range of [0, 1], where I is maximum precision. The notion of 
precision of the phases can be seen in the visual example given in Fig. 0 
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Fig. 3. This picture gives a visual example of how the precision function works. If each 
phase is considered as a unit vector (thin vectors) centered at zero, the precision will 
be the average of these vectors (thick vector) 



This gives a score function of: 



SpDS (^«>^) = 



6=1 



log A + 



M 

E 

m—l 






M 



( 11 ) 



where 



A = 



0 if = 0 V m 

Q if ^ 0 V m 



(12) 



where Q is a constant positive real number. Note that if the AC components 
are ignored, FDS will perform the same as the cosine measure. This is because 
= 1 for any n when b = 1 . Therefore the cosine measure can be viewed as a 
special case of the FDS measure. By examining the spectrum of the words (and 
not just the count) we are able to obtain a better understanding of the content 
of the document. 

The A value was inserted to give a higher ranking to documents which contain 
all of the words in the query. This can be adjusted to suit the users preference. 



5 Computational Complexity 

Choosing an information retrieval method just because it gives accurate results 
is not sufficient. The methods have to be practical. This is why computational 
complexity comes into play when deciding which method is best. In most cases 
there is a trade off between speed and accuracy, where the level chosen should 
suit the user. 

All methods suggested require scanning through the documents, word by 
word. This stage only depends on the length of the documents and has been 
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omitted from the analysis since it is common to all methods. This process can 
also be pre-computed and stored for future classifications. 

The FDS method performs the Fourier transforms on each word in the search 
query. Since the FDS method depends only on the document being processed, 
the spectral values have to be calculated only once and can then stored for later 
use. 

To speed up the process, the Radix-2 FFT was used. The only drawback 
to using this method is that the bin size must be a power of two {B = 2^*, p S N). 
This drops the computational complexity from 0{N‘^) (direct DFT calculation) 
to O (^log 2 iV) (Radix-2). 

This implies that for M unique words in the search query, the time taken to 
calculate the score of one document will be in the order of O ("^^^log 2 iV). 

6 Results 

We conducted two large experiments, one using the TREC database m and the 
other using a database of documents selected from the results of Internet search 
engines. 



6.1 Preliminaries 

To find how effective the FDS method was we compared the results given with 
trials using the cosine measure ^2] and Latent Semantic Indexing (LSI) p. Pre- 
processing was performed on the data to make computation easier and give fairer 
results. This consisted of removing stop words, stemming, and using log-entropy 
normalisation (found in j^). After performing several trials using different values 
of Q, it was chosen to be 1. The bin size was set to 16 to give a good tradeoff 
between accuracy and disk space. The document filtering methods FDS, cosine 
measure, and LSI are then performed to evaluate their relative merits. 

The results (in tableOand tableOj) were evaluated by examining two accuracy 
measures of precision, which in the information retrieval sense, is the measure of 
the proportion of relevant documents to retrieved documents. The two measures 
are: 

Average Precision This value is best explained by observing table d 
i?-Precision is the precision after the first R documents, where R is the number 
of relevant documents for that query. 

The time taken to perform the ranking was very similar for each of the 
methods. 

6.2 Methods Compared 

Cosine Measure. The score for each document was calculated by finding the 
normalised dot product of the document vector (shown in equation d and the 
query vector. This method will be referred to as COS throughout this document. 
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Table 1. The average precision is calculated by first calculating the sub-precisions of 
each relevant document (shown on each line of the table). Sub-precision is found by 
including only the documents ranked higher than the selected relevant document 



Relevant document number 


Rank 


Precision 


1 


n 


1/n 


2 


r2 


2/t2 


3 


rs 


3/rs 


n 


r„ 


n/rn 


Average Precision: 


E?=l i/^i 

n 



Latent Semantic Indexing. Latent Semantic Indexing produces document 
vectors of smaller dimension than the cosine measure, but closest to them in the 
least squares sense, via Singular Value Decomposition (SVD) |2j. This reduction 
of dimensionality reveals a latent structure of the documents that would not 
have been noticed otherwise. In this experiment, the dimension was reduced to 
280. 

Due to the large amounts of data and most of it being zeros, a sparse matrix 
data structure was used (found in |2]). Results were found by comparing each 
document vector to the query vector using the normalised dot product. The 
query vector was created by taking the average of the word vectors (produced 
by the SVD) that appeared in the search term. This method will be referred to 
as LSI throughout this document. 



6.3 Experiment One: TREC Data 

The TREC data mi contains millions of documents from many different sources. 
This is useful to evaluate Internet search engines and searching tools for large 
databases. The method proposed in this paper is a refining process. It takes a 
subset of the whole data and extracts the truly relevant information from that. 
To emulate the process of the search engines, a number of random documents 
were chosen from the original data set, while making sure the documents clas- 
sified relevant and irrelevant were included (shown in Fig. The irrelevant 
documents are those that have been classified as relevant by other information 
retrieval methods but found to be wrong. By including these documents, the 
information retrieval methods applied here will be truly challenged. The results 
were evaluated using ‘trec.eval’, which was supplied by the TREC organisers. 

The results from using the TREC data are shown in table El The data focused 
on was from the Associated Press document set, using queries 101 through to 200. 
Two experiments were run, the first processed 500 pseudo-random documents 
from the AP document set, the second processed 1000 documents. 
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Fig. 4. To emulate the process of the search engines, a sample of relevant and irrelevant 
documents were taken from the TREC data set to feed into the document filtering 
process 



It should be noted that it would have been very unlikely that any document 
contained every word in a selected query, due to the nature of the query. Rather 
than being a phrase or a few key words (which is what would be normally 
supplied to a search engine), the queries were structured into the form of a 
description of desired documents. 

Table 2. Results given by trec_eval from a semi-random sample of documents using 
various filtering methods. FDS has the greatest precision for 500 documents and gives 
similar results to COS for 1000 documents 





500 documents 


1 1000 documents 


Method 


Average 

Precision 


R- 

Precision 


Average 

Precision 


R- 

Precision 


FDS 


0.4439 


0.3972 


0.4560 


0.3933 


COS 


0.4371 


0.3830 


0.4598 


0.3930 


LSI 


0.3756 


0.3508 


0.3661 


0.3508 



The results show that FDS gives better results compared to the other meth- 
ods. For all of the methods, the scores are low. A reason for this is the way the 
TREC queries are set up. A typical TREC query is of the form : 

A {type of document} will identify (case 1} and (case 2} or . . . but not . . . 

Therefore it requires parsing to obtain the keywords and anti- key words. Some of 
the query results with lower precision did not contain examples. They only con- 
tained statements like “...contains information about a country...”. The methods 
used contained no data on what country names are and so could not find the 
relevant documents. 
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6.4 Experiment Two: Internet Documents 

The following results were obtained from a data set of documents returned from 
various search engines on the Internet after searching for three items with various 
degrees of difficulty. The query “Aluminium Cricket Bat” had only a few relevant 
documents, “Bullet the blue sky mp3” had some relevant documents, and “Prank 
calls from Bart Simpson to Moe” contained many relevant documents. The data 
was cleaned by extracting the text from the html structure, changing all the 
letters to lowercase, removing any stop-words (listed in a previously compiled 
list) and converting each word to its stem by using Porter’s Stemming Algorithm. 

The html documents were individually examined and assigned a label of 
relevant or not relevant. These were then compared with the score given by the 
listed methods (mentioned throughout the document). The results are presented 
in table El showing the method, the number of documents used, the number of 
relative documents, and the precision of the method for that search. 



Table 3. This table shows how well the each method worked on different sets of 
documents retrieved from the Internet 



Search for “Aluminium Cricket Bat” 



Method 


No. of Documents 


Average Precision 


R-Precision 


Total 


Relevant 


FDS 


120 


2 


0.5667 


0.5000 


COS 


120 


2 


0.2857 


0.5000 


LSI 


120 


2 


0.2845 


0.5000 


Search for “Bullet the blue sky mp3” 


Method 


No. of Documents 


Average Precision 


R-Precision 




Total 


Relevant 






FDS 


120 


13 


0.9822 


0.9231 


COS 


120 


13 


0.7467 


0.7692 


LSI 


120 


13 


0.6567 


0.6923 


Search for 


‘Prank calls from Bart Simpson to Moe” 


Method 


No. of Documents 


Average Precision 


R-Precision 




Total 


Relevant 






FDS 


120 


27 


0.9279 


0.9259 


COS 


120 


27 


0.7425 


0.7778 


LSI 


120 


27 


0.7349 


0.7778 



By observing the results obtained, it can be seen that the FDS technique is a 
superior method and works far better than the LSI and COS document indexing 
schemes. This is due to the fact that FDS is able to extract more information 
from the documents. LSI and COS treat the documents as though they are a 
Aag of words’ , while FDS observes any structures of the searched terms in the 
document. COS can be considered a special case of the FDS method, therefore 
FDS is expected to obtain better results. 
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7 Conclusion 

While Internet search engines produce good results, they don’t always give us 
exactly what we want. The proposed FDS method over comes this problem by 
filtering the results given by the search engines. The LSI and COS methods need 
a wide range of document types to really focus on the important documents. In 
this case most of the documents will be of the same class and therefore these 
methods will not work as well as the new FDS method. 

LSI and COS methods could be considered sub- methods of FDS. LSI and 
COS methods consider only the DC components while FDS makes use of the 
full spectrum. This shows that LSI and COS do not require as much storage 
space for the calculations since it only takes a fraction of the data needed by 
FDS. But at the current rate in which storage media is growing in capacity, 
this is hardly an issue. The size of the stored information is proportional to the 
number of frequency components, which can be adjusted by changing the words 
per bin. 

FDS can be implemented on the client side (as discussed throughout this 
document) or it could be implemented on the server side. It can easily be included 
in systems like Internet search engines since it is scalable (when extra documents 
are introduced into the database, the other documents are not affected), the 
frequency data can easily be put into an indexing table (current indexing tables 
only include the DC component, therefore FDS would be a simple extension of 
this), and the most important reason is that it gives excellent results. Including 
the FDS method in any search engine would boost the quality (in terms of 
results) of the search engine, and return a more relevant document set. 

Acknowledgements. We would like to thank the ARC Special Research Centre 
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research. 
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Abstract. We present a method for distinguishing two subtly different 
mental states, on the basis of the underlying brain activation measured 
with fMRI. The method uses a classifier to learn to distinguish between 
brain activation in a set of selected voxels (volume elements) during 
the processing of two types of sentences, namely ambiguous versus un- 
ambiguous sentences. The classifier is then used to distinguish the two 
states in untrained instances. The method can be generalized to accom- 
plish knowledge discovery in cases where the contrasting brain activation 
profiles are not known a priori. 



1 A fMRI Study of Sentence Processing 

1.1 Introduction 

This paper builds on an fMRI (functional Magnetic Resonance Imaging 0) study 
of cortical activity during the reading of syntactically ambiguous sentences |2| . 
The latter are sentences in which a word can have one of two lexical and syntactic 
roles, and there is no disambiguating information in the context that precedes 
the ambiguity. However, the ambiguity is resolved by information occurring later 
in the sentence. For instance, in 

The horse raced past the barn escaped from his trainer. 

the meaning of the sentence is clear after the word “escaped” is reached. The 
ambiguity occurs at “raced” , which is could be interpreted as either a past tense 
(preferred) or a past participle (unpreferred). An unambiguous sentence could 
be, for example 

The experienced soldiers spoke about the dangers before the midnight raid. 
where “spoke” is unambiguously the past tense form. 
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The study analyzed the activation in different parts of the brain every 1500 
msec during the reading of ambiguous sentences and unambiguous control sen- 
tences. The analysis performed examined both the amount and location of such 
activation and the contrast between the two types of sentences, expressed in 
those terms. 

One additional dimension of analysis could be the characterization of what is 
different between activation in the two experimental conditions. In addition to 
the amount of activation triggered, one might have to consider differences in the 
shape of the activation response, in localization (i.e. some points are only active 
in one of the conditions), and in timing. One could even consider the question 
of ascertaining whether there is more than one kind of cognitive process taking 
place. 

But while it is relatively simple to test for such things as differing amounts 
of activation, that is not the case for the other questions. We propose a method 
for identifying specific locations in the brain where activation patterns are dis- 
tinguishable across experimental conditions and, through that set of locations, 
allowing the discovery of answers to the questions above. 

1.2 Syntactical Ambiguity Experiment 

Let us take a closer look at what it means for a sentence to be ambiguous. The 
development of the sentence can be more or less surprising, and sentences taking 
the less likely meaning are called ambiguous unpreferred sentences. Ambiguous 
sentences which develop with the most predictable meaning are called ambiguous 
preferred sentences, and sentences without any ambiguity are unambiguous sen- 
tences. We shall concentrate on distinguishing ambiguous unpreferred and unam- 
biguous sentences, and hence shall use the designations ambiguous/unambiguous 
from this point onwards. 

The study above concentrated on two cortical areas known to be involved 
in sentence processing, the Left Inferior Frontal Gyrus (LIFG), also known as 
Broca’s area, and the Left Superior Temporal Gyrus (LSTG)/ Left Posterior 
Middle and Superior Temporal Gyrus (LMTG), known as Wernicke’s area. These 
will henceforth be referred to as Regions of Interest (ROIs) . During sentence pro- 
cessing, these two areas showed a significant increase in activation when com- 
pared to their behavior during a control condition. It was also observed that 
brain activation went to a higher level, and remained at such a level for a longer 
period of time, during the processing of ambiguous sentences than it did during 
the processing of unambiguous sentences. As the processing of ambiguities leads 
to an increase in the demand for cognitive resources P|, such an increase results 
in additional cortical activity, which we are interested in characterizing. 

1.3 Data Processing and Analysis 

Each subject was presented a sequence of 20 trials (sentences), 10 ambiguous 
and 10 unambiguous, presented in a random order. Each trial consisted of the 
presentation of a sentence for 10 seconds, followed by a yes/no comprehension 
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question. Cortical activity during the processing of each sentence was recorded 
every 1500 msec, providing a time series that constituted that basis of our data. 
The part of the brain under scrutiny is divided into a number of volume elements 
called voxels, measuring 3. 125x3. 125x5 mm. During the experiment, the BOLD 
(Blood Oxygen Level Dependent) signal at each voxel was measured every 1500 
msec. This response is an indirect indicator of neural activity 0, and thus we 
will use the terms activity and amount of activation to refer to the level of the 
said response. An image is a set of such activation values, with one value per 
voxel. As images are acquired, the result is a succession of values for each voxel, 
containing its activation values at each instant of the experiment. The image are 
acquired not volumetrically but sequentially in slice planes 5 mm thick, with the 
acquisition of all 7 slices distributed over 1500 msec. The slight differences in 
acquisition time for the different slices is later corrected for by an interpolation 
technique. 

The recordings available at each voxel will then consist of one time series 
of the activation for each sentence, as well some extra series corresponding to 
“baseline” activation during a control condition, during which the subject just 
fixates an asterisk instead of going through sentence-reading. Obviously, during 
the latter there should be no sentence-processing related activity. 

The next step in the analysis was the identification of active voxels. These are 
voxels that display a significant activity in any of the experimental conditions. 
The activity during the experimental condition is gauged by comparing it with 
the one taking place during the control condition. This was done using a voxel- 
wise t-test comparing the activation level in the baseline condition and during 
all other conditions. A very high t threshold is used (equivalent to a Bonferroni 
correction for multiple comparisons) to identify voxels whose activation level 
during sentence processing differs significantly from their level during the control 
condition. 

The time courses of the active voxels in a subject were then averaged across 
the 10 sentences of the same type and normalized as “percentage of activation 
above baseline”. Afterwards, these were averaged across subjects. The end result 
was an average timecourse for every sentence trial, from which was obtained the 
average timecourse for each type of sentence. From analyses of variance of the 
data it was possible to conclude that brain activation went to a higher level, and 
remained at such a level for a longer period of time, during the processing of 
ambiguous sentences than it did during the processing of unambiguous sentences. 

The expected model for this sentence processing task features each ROI re- 
cruiting voxels from a certain pool as resources for sentence processing in general. 
If a particular sentence demands more resources than are available in the pool 
(through its being ambiguous, for instance), more activity will be demanded 
from pool voxels and, eventually, other voxels might be recruited. We would 
like to extend this study by analyzing the degree and manner of involvement of 
these specially recruited voxels in the task being performed, and this effort will 
be described in Sect. 2. 
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2 Identifying Voxels with Varying Behaviour across 
Conditions 

2.1 Our Approach and Related Work 

The problem as we see it consists in ascertaining whether the behaviour of the 
BOLD response at some voxels is distinguishable between the two experimental 
conditions. If so, we would have candidate locations that might be supporting the 
additional activation required for processing ambiguous sentences, which could 
then be examined. 

Currently, this question is addressed by identifying the most active voxels 
in two different experimental conditions, averaging their time series under each 
condition and then comparing the two averages. Identification of active voxels is 
done through a voxelwise t-test on one of the following: the difference between 
the mean activity during experimental and control conditions or the correlation 
of the voxel time series with a paradigmatic time series which corresponds to 
the expected response for voxels involved in the task. 

A more agnostic approach is the clustering of the time series of all the voxels, 
guided either by known constraints on present cognitive processes or just using 
hierarchical clustering and using one of several possible metrics (see 0)- The 
centroids of the clusters thus found are then examined in the same way as the 
average time courses found through t-tests. 

Yet another approach is to use a bayesian model of the fMRI signal at each 
voxel (see 0). This can be used for questions beyond that of whether a given 
voxel is active or not, such as the influence of experimental condition on the 
parameters of the model, while being subject to assumptions regarding plausible 
signal shapes, noise and other factors. 

A closer look at the t-test and clustering approaches will reveal that they 
give no guarantees of identifying voxels where the time series are different across 
experimental conditions. To see why consider that a high t- value for the contrast 
between experimental conditions and control only pertains to the mean activity 
and says nothing about the shape of the time series, which may be different for 
voxels that behave differently across the two experimental conditions In addition, 
the mean activity may be lower for such voxels than for the majority of voxels 
that accounts for the bulk of the activation. 

If we were to user the t-test approach to test the mean during one experimen- 
tal condition against the mean during another we would probably find that the 
means were too similar for strong results in most voxels where there is activity 
in both conditions. A clustering approach applied separately to each condition 
would still identify groups corresponding to the bulk of activation, which coin- 
cides in the two conditions. 

The bayesian modelling approach allows for greater flexibility in that ques- 
tions besides that of whether a voxel is active or not can be posed. In our case, 
the question would be whether the shape of the time series differs on a point by 
point basis across the two conditions. The caveat in this case is that the model 
pressuposes a certain BOLD response shape. Model fitting is accomplished by 
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estimating the values of the model parameters from data. While this is fine for 
the bulk of the active voxels, which mainly share the same response shape, it 
might not work for voxels where the shapes of the response are unusual in one or 
both conditions. Moreover, prior information about response shape is, in many 
cases, derived from observations in areas such as motor cortex, which need not 
be exactly the same for areas performing cognitive functions such as language 
processing. 

Therefore, we would like to find differences in time series for a given voxel 
in a way that is independent both of assumptions regarding response shape 
and of considerations about level of activity. We propose to use a classifier to 
learn the difference, through identification of the two types of sentences under 
consideration, ambiguous and unambiguous with two classes of examples to be 
learned, on a voxel by voxel basis. The features on which the learning will be 
based are the activation values recorded for each voxel at each time point during 
the processing of a sentence. The examples are the time series for each sentence 
for both conditions. 

There are other possibilities for representation of the time courses. In the 
extreme, one might just consider the mean activity. Another possibility is to 
represent the time series as a number of adjacent temporal sections, in terms of 
which the activity is described, which is what is done for the bayesian modelling 
approach cited above. Yet another would be to obtain derived features such as 
spectra obtained via fourier or wavelet transforms and cast the learning problem 
in terms of them. 

Our hope is that the degree of success of a classifier on a voxel can be taken as 
an estimate of how much the activity in the corresponding voxel differs between 
ambiguous and unambiguous sentences, the two experimental conditions. This, 
in turn, should be an indication of the degree of involvement of the voxel in 
specifically processing ambiguity. 

2.2 Experimental Procedure 

The data available for each voxel consists of 10 sentences per condition, where 
each sentence is a time series of 16 activation values (24 sec of data) captured 
during and immediately after the processing of a sentence. We used only values 4 
to 13 in each token, eliminating the pre-rise and post-activity decay of the BOLD 
response. Each of the values in a token has been normalized as a percentage, 
referring to how much it was above the average base level of activation during 
the control condition of the experiment. 

In classifier terms, this maps to 10 examples of each class, where each example 
is a series of 16 floating point values, available for every voxel in both ROIs for 
6 subjects. 

In addition, given the already small number of examples we would like to 
have as few features per example as possible, and thus 10 features were retained 
in the part of the series where differences are more likely, as mentioned above. 

As what is needed is an estimate of how accurately the difference between 
conditions at a given voxel can be learned using a given classifier, we resort to 
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leave-l-out cross validation over the 20 examples available, while taking care to 
balance the number of training examples across both classes. Initially, we create 
10 random pairs with one example of each class. For each pair, we train on the 
remainder 18 examples, 9 of each class, and then test each of the examples of 
the pair. The estimate for attainable classifier accuracy for this voxel will be 

4^0 f successes 
20 ■ 

The classifiers used in each voxel were a neural network (NN) with one sig- 
moidal hidden unit (see jS]), effectively doing logistic regression, and a linear 
kernel support vector machine (SVM) (see 0). The choice of classifier was lim- 
ited by the small number of examples available. We also tried other alternatives, 
such as SVMs with more complex kernels, NN with more hidden units and a sim- 
ple k-nearest neighbour classifier. These were discarded because the performance 
was worse. 

There are thus twelve sets of classification problems, with the set per each 
subject/ROI combination containing hundreds of classification problems, one per 
voxel. The output of the process is, for each subject and ROI, a list of voxels in 
that ROI ranked by decreasing classifier accuracy. Within groups of voxels with 
the same accuracy level, an additional ranking is performed based on “quality” 
of the classification (lowest mean squared error for NN or highest absolute value 
of the decision function for SVM). For each classifier, the process is run with 
different seeds for a number of times, the resulting rankings are averaged, and 
it is on these averaged lists that the analysis detailed below is performed. 



2.3 Experimental Results 

We found that it is possible to discriminate between activity in each of two 
experimental conditions for a small subset of the examined voxels. After applying 
the procedure described, we obtained consistent results across subjects and ROIs, 
in that a subset of 1% of the classifiers almost always has mean accuracy of 
80% or more, for the best performing method (NN). The value of 1% typically 
corresponds to 3 to 7 voxels per subject/ROI (i.e. about 150 to 350 cubic mm) 
out of an anatomically huge volume of cortex. Considerations of how plausible 
the voxels found are from the psychological point of view are dealt with in the 
next section. 

In each subject/ROI pair the distribution of accuracies is a unimodal curve 
centered around 50%, with a heavier tail for the higher accuracies and a smaller 
one for low accuracy voxels. Details about the distribution of the accuracy scores 
are discussed in the section that compares the results to a null model. 

Because the goal was to find the relatively small group of voxels which acti- 
vated differently during the processing of the two types of sentences, we will pro- 
ceed considering only the top 1% most accurate voxels in each ROI and subject 
combination. Note that this number of voxels (3-7) is on the order of magnitude 
of the number of voxels believed to show real activation, typically demonstrated 
by their time-locking to the stimulus events and their signal amplitude in the 
experimental conditions, as detected through a t-test. 
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Table 1 details the mean accuracy attained in the top 1% group of voxels 
classified through each of the two methods. In addition, we were interested in 
finding out whether the voxels for which accuracy was greater using each method 
were the same, and thus the table contains the percentage of overlap between 
the groups. The comparison is made for the 6 subjects and 2 ROIs. 



Table 1. Mean accuracy of the top 1 % group selected using each classification method, 
as well as percentage of those belonging to the two groups 



Subject#/ROI 


Accuracy NN 


Accuracy SVM 


Overlap % 


1-LIFG 


0.82 


0.80 


0.50 


1-LSTG 


0.82 


0.79 


0.29 


2-LIFG 


0.78 


0.72 


0.00 


2-LSTG 


0.82 


0.70 


0.17 


3-LIFG 


0.82 


0.81 


0.25 


3-LSTG 


0.77 


0.75 


0.33 


4-LIFG 


0.85 


0.78 


0.67 


4-LSTG 


0.81 


0.79 


0.00 


5-LIFG 


0.86 


0.76 


0.00 


5-LSTG 


0.81 


0.76 


0.00 


6-LIFG 


0.81 


0.83 


0.33 


6-LSTG 


0.87 


0.75 


0.40 


mean 


0.82 


0.77 


0.25 



The NN group had a mean accuracy ranging from 77 % to 87% over subjects 
and ROIs, with an overall average of 82%, as displayed in the second column of 
Table 1. This means that a NN was capable of correctly distinguishing whether 
the subject had been reading an ambiguous or an unambiguous sentence on the 
basis of the fMRI activation data 82% of the time, on sentences on which it had 
not been trained, for each of these voxels. The SVM group had a mean accuracy 
ranging from 70% to 83%, averaging 77%. As this indicated some systematic 
difference between the two groups, we ran an ANOVA test on the difference 
in mean accuracy on the top 1% voxels between the two classifiers for all sub- 
jects/ROIs. In this test NN performed reliably better than SVM, across subjects 
and ROIs, at the 5% significance level, and therefore we will focus the rest of 
the paper on NN selected voxels. 

Interestingly, there was not a great deal of overlap between the groups of 
most predictive voxels selected using each of the two methods, as evident in the 
mean overlap of 25%. This was unexpected, given that both methods resort to 
linear decision boundaries. 

2.4 Comparison to a Null Model 

As the classifiers for a few of the voxels considered do attain high levels of 
accuracy, we would like to demonstrate that that does not happen by chance. 
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In addition, we’d like to know for how many voxels can the effect be considered 
reliable. 

Given the number of voxels involved and the procedure used, a classifier that 
guessed at random could conceivably attain relatively high accuracies in some 
small number of voxels. On the other hand, if we observed this happening for a 
large enough number of voxels, it would be less and less likely. 

One way of testing this is to postulate a null model in which every voxel 
has the same inherent classifiability, and then examine what the probability 
of obtaining our accuracy results under that model is. By its being small it 
will be shown that the underlying model is not correct and that the inherent 
predictability at different voxels varies and can be high for a small group of 
them. 

For a given voxel the accuracy of a classifier is an estimate of an underly- 
ing “true” accuracy attainable with that classifier. Given that the classifier is 
tested over 20 examples, the outcome can be seen as a sample from a binomial 
variable with 20 trials and probability of success equal to the underlying accu- 
racy. In practice, the 20 trials are not independent, as they are a sequence of 
leave-l-out trials in which every pair of trials shares all but one training exam- 
ple. As performing the analysis in the latter case is far more complicated and 
would introduce details specific to the classification method used, we will pro- 
ceed assuming independence. An empirical argument as to why the results thus 
obtained can still be used is given later. 

Let us assume as a null hypothesis that the accuracy in each voxel is the 
same, and is some value close to 50%. The latter can be the empirical average 
of accuracies attained in most ROIs, which is indeed around that value. 

We will examine the probability of the higher scores in a ROI under this 
model, assuming the ROI has n voxels: 

Pr{V*max > a) = 1 — Pr{max < a) = 1 — (cd/(a))" 

and, in general, 

Pr{k*^max > a) = — (cd/(a))”“*''"^) 

where cdf{a) = Pr{X < a) when X Binomial{20,p), with p being the mean 
observed accuracy across the ROI. 

Under this model, we can calculate the probability that the high score 
would be observed, and thus declare the probability as significant (and unex- 
pected under the model) if it falls belows a certain threshold. For our analysis 
we used this criterion, and considered an accuracy level significant if had a prob- 
ability of 5% or less of occurring under the model. 

Table 2 contains the results of this experiment discriminated by subject/ROI 
combination, with the number of voxels considered significant out of the total, 
the least accuracy attained on one of those voxels and the percentage of voxels 
out of the total that constitute the group. 

As stated before, these results are related to a null model where we assume 
that the results of each of the 20 leave-l-out trials for a voxel are independent, 
for simplicity reasons. Treating the case where the trials are not independent is. 
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Table 2. Breakdown of the number of voxels with significant differences across condi- 
tions 



S ub j ect RO I 


:j^signiflcant 


out of 


accuracy of lowest 


top % 


1-LIFG 


2 


395 


90 


0.5 


1-LSTG 


0 


617 


0 


0.0 


2-LIFG 


0 


257 


0 


0.0 


2-LSTG 


3 


543 


85 


0.6 


3-LIFG 


3 


329 


85 


0.9 


3-LSTG 


1 


536 


95 


0.2 


4-LIFG 


3 


257 


85 


1.2 


4-LSTG 


0 


516 


0 


0.0 


5-LIFG 


3 


244 


85 


1.2 


5-LSTG 


2 


204 


85 


1.0 


6-LIFG 


1 


269 


90 


0.4 


6-LSTG 


5 


415 


85 


1.2 



in our view, too complicated, so instead we decided to run an empirical test, as 
follows. 

The same setup as for the NN experiments was used, but the data was 
randomized. Five of the ten examples in each class were selected at random 
and switched to the other class. In this fashion each classifier was guaranteed to 
have five correct examples in each class and five incorrect ones, and its expected 
accuracy should not be more than 50%. Note that this is a case where the results 
in each of the 20 leave- 1-out trials are certainly not independent, as each pair 
of trials shares most of the training data. The same number of repetitions were 
made and the results were ranked in the same way as originally. 

Looking at the number of voxels given by the null model procedure in the 
randomized results we notice that their accuracies are far below what could be 
attained by the null model outputting random classifications. 

The point is that the few classification results on the table are deemed im- 
probable under the null model with the independence assumption, and they are 
far more improbable under a true model where the expected accuracy is 50% and 
where maximum accuracy practically never rises above 70%. As a consequence, 
it is reasonable to think that the test based on the independence assumption is 
conservative. 

Moreover, most of the subject/ROI combinations contain a few voxels that 
are significant by this criterion. Given that our wish is to narrow down the 
number of voxels to be manually examined, we feel that the possibility of allowing 
some false positives in the group is acceptable. Nevertheless, we will proceed by 
considering only the voxels deemed significant through the procedure above. 

Our conclusion is that for most voxels there is no inherent discernibility, but 
that it very probably exists for at least a small subset of them, and that this 
warrants the exploration described in the previous sections. 
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2.5 Voxel Characteristics 

The group of voxels in consideration in each subject/ROI was picked because 
the activity in each voxel was discernible across the two experimental conditions. 
This may have been so because of heightened activity during the processing of 
ambiguous sentences. However, it may also mean that what stands out is the level 
of activity during the processing of unambiguous sentences as being unusual. 

There are also no guarantees that the identified voxels have a high degree of 
activation, at least on average. In fact, for each combination of subject and ROI, 
the group identified almost always has no voxels in common with the subset of 
the top 1% most active voxels for the same combination, as selected through a 
t-test. This t-test compares the mean activity during experimental conditions 
and during a control condition which acts as a baseline. 

One possible application of being able to find this group of voxels is to identify 
common characteristics in their time courses, such as the onset and duration of 
higher activity in a given experimental condition. 

There was no clear trend in the logistic regression coefficients found, which 
was our initial expectation and would allow us to target a specific temporal 
region as the source of the difference. In addition, some of the voxels found 
displayed higher activity for ambiguous sentences, others for unambiguous ones. 

The voxels with higher activity in ambiguous sentences did correspond to our 
expectation of the type of voxel recruited when resource demands are extreme. 
This is reflected in their not showing a consistent higher activation throughout 
the task, but rather high activity intervention in short bursts, which presumably 
would be where resources are more demanded for the ambiguous sentences. 

A tentative explanation for this would be that different sentences are not 
matched for length, and thus the ambiguity, and the corresponding demand for 
extra resources, can occur in different points in time. This contrasts with the 
voxels in the most active group, which display consistent high activity through- 
out most of the of the task, as they are involved in the main processing, and 
higher activity during processing of ambiguous sentences. 

A serious objection may be that many of the voxels selected do not show a 
clear spatial distribution, contrary to what happens for the active voxels, which 
tend to cluster tightly and where activation spreads radially as demand increases. 
While a few are adjacent to such active voxel groups, most are set at locations 
further away (still inside the ROI) . If we expect a model where activity percolates 
to neighbours from a centre as demand rises, then this is hard to explain. 

There were not significant contrasts in accuracy across subjects, ROIs and 
experimental conditions (across sentences being tested). 

2.6 Conclusion and Further Work 

We have presented a novel method for identifying voxels with contrasting be- 
haviors across experimental conditions in a fMRI study. Through its use it was 
possible to find a subset of such voxels in a real dataset. 

Unfortunately, further analysis of this subset of voxels across subjects failed 
to reveal any striking temporal patterns of activity or contrasts in accuracy 
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related to varying experimental variables (condition, subject, ROI). Many of 
the selected predicting voxels seem to have been picked for reasons not readily 
observable with the naked eye or easily relatable to the actual processing, which 
was our original hope. We are uncertain about what attributes of the time series 
of the predicting voxels are used by the classifier. 

We do think, however, that the use of this method may be more successful 
in somewhat different studies. This would require alterations in experimental 
design so as to have sentences with similar lengths and positioning of ambiguity. 
Other possibilities lie in the use of alternative representations of the time series, 
be it through the use of composite features built from the initial series or through 
representations incorporating some prior assumptions regarding BOLD response 
shape. While limiting what can be learned, the latter might not be so restrictive 
as a full model of the response shape with parameters fitted to the data. 

Another possibility would be to use the method to compare not two ex- 
perimental conditions but all the conditions against the control condition. This 
could be used in place of t-tests for identifying active voxels as those where there 
is a greater contrast between activity during the experiment and activity dur- 
ing control, where the activity should be minimal and, above all, unstructured. 
The t-tests often performed for this purpose take into account solely the mean 
activity during experiment, and therefore our method would incorporate more 
information and possibly provide a better result. A completely different avenue 
rooted in the same idea would be the use of statistical tests of the difference 
between the distributions of time courses from the two classes for a given voxel 
(see, for instance, 0). 
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Abstract. In many applications, it becomes crucial to help users to ac- 
cess to a huge amount of data by clustering them in a small number of 
classes described at an appropriate level of abstraction. In this paper, 
we present an approach based on the use of two languages of description 
of classes for the automatic clustering of multi-valued data. The first 
language of classes has a high power of abstraction and guides the con- 
struction of a lattice of classes covering the whole set of the data. The 
second language, more expressive and more precise, is the basis for the 
refinement of a part of the lattice that the user wants to focus on. 



1 Introduction 

The main goal of our approach is to help users to access to a huge amount of 
multi-valued data by clustering them in a small number of classes organized in 
a hierarchy and described at an appropriate level of abstraction. The data (i.e. 
instances) we want to treat are described by a various set of attributes that 
can be multi-valued and each instance is labelled by the name of a basic type. 
The approach is based on the use of two languages of description of classes. 
The first language of classes has a high power of abstraction and guides the 
construction of a coarse lattice covering the whole set of the data. The second 
language, more expressive and more precise, is the basis for the refinement of a 
part of the lattice that the user wants to focus on. This second language actually 
allows us to distinguish new clusters. Besides, we exploit the fact that the data 
are labelled: we gather instances sharing the same label in basic classes and we 
construct clusters of basic classes rather than clusters of instances. 

In Sect. 2, we present languages of instances and classes together with some basic 
operations used in the lattice construction steps. Section 3 describes the first 
clustering algorithm leading to the construction of a crude lattice. A part of this 
lattice may be refined using the second clustering algorithm described in Sect. 4. 
Our approach has been implemented and experimented on real data in the setting 
of the GAEL project which aims at building flexible electronic catalogs. Our 
experiments have been conducted on real data coming from the C/Net electronic 
catalog of computer products (http : //www. cnet . com). Experimental results are 
presented in Sect. 5. 
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2 Languages of Instances and Classes 

In this section, we define the language of instances, £i, in which we describe 
multi-valued data, and the two languages of classes £2 and £3 that we use to 
describe classes over those data, at two levels of abstraction. First, we provide 
some notations and preliminaries. 

2.1 Preliminaries and Notations 

Given a language of instances, a language of classes £ defines the expressions 
that are allowed as class descriptions. A class description is intended to represent 
in an abstract and concise way the properties that are common to the set of its 
instances. A membership relation, denoted by isac, establishes the necessary 
connection between a given language of instances and an associated language of 
classes £. 

Definition 1 (Extension of a class description). Let I be a set of instances, 
and C a C class description. The extension of C w.r.t I is the following set: 

exti{C) = {i & I \ i isac C} 

The subsumption relation is a preorder relation between class descriptions, 
induced by the inclusion relation between class extensions. 

Definition 2 (Subsumption between classes). Let Ci and C2 be two £ 

class descriptions. Ci is subsumed by C2, denoted C\ <c C2, iff for every set L 
of instances, exti{C\) C exti{C2). 

In Sects. ISI and IHI we will provide a constructive characterization of sub- 
sumption for the two languages of classes that we consider. 

The notion of abstraction of an instance in a language of classes £ corre- 
sponds, when it exists, to the most specific class description in £ which it is an 
instance of. 

Definition 3 (Abstraction of an instance). Let i be an instance, the £ class 
description C is an abstraction of i in C (for short C = absc{i)) ijf 

1 . i isac C, and 

2 . if D is a class description such that i isac D, then C cic D- 

The notion of least common subsumer will be the basis for gathering classes 
in our clustering algorithm. 

Definition 4 (Least Common Subsumer). 

Let Cl, . . . ,Cn be class descriptions in £. The £ class description C is a 
least common subsumer o/C'i,...,C„ in L (for short C = lcsc{C\, ... ,Cn)) iff 

Ci -<c C for all 1 < i < n, and 

2 . if D is a class description satisfying Ci -<c D for all 1 < i < n, then C :<c D 
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2.2 The Language of Instances 

The data that serve as instances of the classes that we build are typed (i.e., 
labelled by the name of a basic type) and described by a set of pairs (Attribute, 
Value). The attributes used for describing the data may vary from an item to 
another. In addition, an attribute can be multi-valued. 

Definition 5 (Terms of Cx). Let B he a finite set of basic types, A a finite set 
o/ attributes, and V a set o/ values. A term of L\ is of the form: 

{c, atti = Vi, . . . , attn = Vn} 

where c € B, \/i € [l-.n], atfi G A and Vi CV. 

The description of an instance is a term of C\. For example, we can find a 
product in the C/Net catalog, whose £i description is: 

{RemovableDiskDrive, CD/ DVD /Type={CDRW}, 

StorageRemovableT ype={Super Disk} ,C ompatibility={M AC , PC}} 

In the following, we will consider that the type c of a £i description is a 
boolean attribute. 



2.3 The Language of Classes £2 

Definition 6 (Class description in £2). A £2 class description (of size n) 
is a tuple of attributes {atti , . . . , attn}, where Vi G [l..n], atfi G A. 

The connection between the language of instances £1 and the language of 
classes £2 is based on the following definition of the membership relation. 

Definition 7 (Membership relation for £2). Let i be an instance description 
in £1 . Let C he a C2 class description: i is an instance of C iff every attribute 
appearing in C also appears in i. 

The following proposition, whose proof is straightforward, characterizes sub- 
sumption, least common subsumer and abstraction in £2. 

Proposition 1 (Properties of £2). 

• Let C\ and C2 be two £2 class descriptions. Ci <C2 C'2 iff every attribute 
of C2 is also an attribute of C\ . 

• Let {atti = Vi, . . . ,attn = Vn} be an instance description in L\. Lts ab- 
straction in £2 is unique: it is {atti , . . . , attn}- 

• Let Cl, . . . ,Cn he n C2 class descriptions. Their least common subsumer 
is unique: it is made of the set of attributes that are common to all the Ci ’s. 
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2.4 The Language of Classes Cz 

£3 is richer than £2 on different aspects: it makes possible to restrict the possible 
values of an attribute ; it enables to distinguish the number of values of an 
attribute through different suffixes (*, + ,?, e) whose notation is inspired by the 
one used in XML for describing document type definitions (DTDs), and whose 
formal semantics corresponds to standard description logics constructors. 

Definition 8 (Class description in £3). A £3 class description (of size n) 
is a tuple 

{attf^^^^ : Ml, : Vn} 
where Vi C [l-.n], atti € A, Vi C V, and suffi € {*,+,?,e} 

The following definition formalizes the membership relation between an in- 
stance and a class description in £3. 

Definition 9 (Membership relation for £3). Let i be an instance description 
in £1 . Let C he a Lz class description, i is an instance of C iff every attribute 
in i appears in C and for every term atV^^^ : V appearing in C, 

- when suff=*, if there exists V s.t att=V S i, then V' C V , 

- when suf f=+, there exists V VV s.t att=V S i, 

- when suff=7, if there exists V s.t att=V' € i, then V is a singleton and 
V' C V, 

- when suff=e, there exists V singleton s.t V' VV and att=V C i. 

The product described in 12. 21 is an instance of the £3 class description C\: 

{RemovableDiskDrive^ : {true},CD/ DV D / ReadSpeed/:{20x,3‘^x,24:x}, 
CD/DVD/Type^:{CDROM, CDRW}, Compatibility+:{MAC, PC}, 
StorageRemovableType'^:{SuperDisk, ZIP, JAZ}} 

It represents the set of products that have in their description (i) necessarily the 
monovalued and boolean attribute RemovableDiskDrive whose value must be 
true, (a) possibly the attribute CD/DVD/ReadSpeed, and if that is the case, 
this attribute is monovalued and its value belongs to the set {20x, 32a;, 24x}, 
(Hi) necessarily the attribute CD /DVD /Type, which is monovalued and takes 
its value in the set {CDROM,CDRW}, (iv) necessarily the attribute Compat- 
ibility, which can be multivalued and takes its value(s) in the set {MAC, PC}, 
(v) necessarily the attribute StorageRemovableType, which is monovalued and 
takes its value in the set {SuperDisk, ZIP, JAZ}. 

The following propositions state the main properties of £3. Their proofs 
follow from results in tractable description logics where structural subsumption 
is complete. 

Proposition 2 (Characterization of subsumption in £3). Let C\ and C2 

be two £3 class descriptions. C± C2 iff all the attributes appearing in C\ 
appear also in C2 and for every pair atP^^^ : V appearing in C2, 

- when suf f=*, if there exists atC^^^ : V € Ci, then V' C V, 
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- when suff=+, there exists V CV s.t att~^ : V' G Ci or att^ -.V gC\ 

- when suff=l, if there exists att^'^^^' : V G Ci, then suf f =1 or suff'=e, 
and V' C V, 

- when suff=e, there exists V s.t V G-V and att^ \V G C\. 



Proposition 3 (Characterization of abstraction in £ 3 ). Let {atti=Vi , . . . , 
attn=Vn\ he an instance description in L\. Its abstraction in £3 is unique: 
abscs = {attf^^^^ : V\, . . . ,attff^ : Vnj, where Vi G [l-.n], if \ Vi \> 2 then 
suffi=+ else suffi=e. 



Proposition 4 (Characterization of Ics in £ 3 ). Let Ci, . . . , C„ hen class 
descriptions. Let A he the set of attributes belonging to at least one description 
Ci. Cl, . . . ,C„ have a unique least common subsumer in £ 3 , whose description 
is characterized as follows: 

— for every attribute att G A, let V be the union of the sets of values associated 
with att in the class descriptions Ci ’s: V = (Ji I atC^^^ : Vi G Ci}. 

- ate : V G lcs{Ci, . . . , Cn) iff atC : Vi G Ci Wi G 

- att^ : V G lcs(Ci, ..., C„) iff 

(Vi G [l..n] att* : Vi ^ C\ and att^ : Vi ^ Ci), and 

• either 3i G [l..n] s.t. att^ : Vi G Ci, 

• or 3i G [l..n] s.t. atC : V' ^ C\ for any s. 

- att* : V G lcs{Ci,. . . , C„) iff 

• either 3i G [l..n] s.t. att* : Vi G Ci, 

• or 3i G [l..n] s.t. att'^ : Vi G Ci, and 3j G [l..n] s.t. att^ : Vj G Cj 
or ate : V' ^ Cj for any suffix s' . 

- ate'' : V G lcs{Ci, ..., C„) iff 

3i G [l..n] s.t. ate'' : Vi G Ci and Vj G : Vj G Cj or 

ate : Vj G Cj 

For example, if C 2 is the £3 description: 

{Compatibility* :{PC, U nix} ,StorageRemovableType^:{D AT} , 
CompressedCapacity" : {8,24,32,70}} 
lcs{Ci,C2)= 

[RemovableDiskDrive^ : {true}, CD / DVD / ReadSpeedf :{2Qx,‘V2x,2Ax}, 
CD/DVD/Type-:{CDROM, CDRW}, Compatibility*: {MAC, PC, Unix}, 
StorageRemovableType":{SuperDisk, ZIP, JAZ, DAT}, 

C ompressedC opacity'' :{&, 24, 32, 70}} 

3 Construction of a Lattice of C 2 Classes 

The goal is to structure a set of data described in £1 into clusters labelled by £2 
descriptions, and organized in a lattice providing a browsable semantic interface 
facilitating the access to the data for end-users. We proceed in two steps: 
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1. In the first step, the data are partitioned according to their type: for each 
type c, we create a basic class named c. Its set of instances, denoted inst{c), is 
the set of data of type c. Its £2 description, desc{c), is obtained by computing 
the least common subsumer of the abstractions of its instances. The result 
of this step is a set C of basic classes and a set A of attributes supporting the 
£2 descriptions of the classes of C. For each attribute a, the set classes{a) 
of basic classes having a in their description is computed. This preliminary 
clustering step has a linear data complexity. 

2. In the second step, a lattice of clusters is constructed by gathering basic 
classes according to similarities of their £2 descriptions. In this step, clusters 
are unions of basic classes. The computational complexity of this step does 
not depend on the number of initial data but only on the size of the £2 
descriptions of basic classes. 

We now detail this second step. A cluster . . . Cj^, will appear in the lat- 
tice if the £2 descriptions of the classes . . . Ci^ are judged similar enough to 
gather their instances. The similarity between class descriptions is stated by at- 
tributes in common. However, we take into account only attributes that do not 
occur in too many classes. For instance, the attribute price may appear in all the 
instances of a catalog describing products, and is therefore not useful to discrim- 
inate product descriptions. Among the set A of attributes, we select meaningful 
attributes as being the attributes a ^ A such that ^ < s where s is a 

certain threshold (e.g., s = 0.8). Let ^0 be the set of meaningful attributes. We 
redescribe all the basic classes in terms of the attributes of Aq only: for a basic 
class c, we call its short description, denoted shortdesc(c), the £2 description 
of c restricted to the meaningful attributes: shortdesc(c) = desc{c) fl Aq. 

Our clustering algorithm, C2- Cluster, is described in Algorithm 1. It is 
adapted from a frequent item set algorithm (| 2 |). It iteratively builds levels of 
clusters, starting with building the level of the coarsest clusters corresponding 
to unions of basic classes having atleast one attribute in common. Each iteration 
k is guided by attribute sets of increasing size k which, being common to some 
class descriptions, are the support of the creation of a potential node gather- 
ing those classes. Among those potential nodes, we effectively add to the lattice 
those whose £2 short description is equal to their fc-support: the fc-support of a 
node generated at iteration k is the fc-itemset supporting the generation of that 
node. By doing so, we guarantee that the description of the nodes added to the 
lattice is strictly subsumed by those of their fathers. 

Notation: We call a A:-itemset a set of attributes of size k. We assume that 
attributes in itemsets are kept sorted in their lexicographic order. We use the 
notation p[i] to represent the i-th attribute of the fc-itemset p consisting of the 
attributes p[l], . . . ,p[k] where p[l] < ... < p[k\. 

Figure [H shows the lattice returned by C2~Cluster when it is applied on C = 
{ci, C2, C3, C4, C5} and A — {ai, 02, 03, 04} such that: shortdesc(ci) = {oi, 02, 03} 
shortdesc{c2) = {02} shortdesc{cz) = {01,03} shortdesc{c4) = {03,04} 
shortdesc{c^) = {01,03} 
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Algorithm 1. Zl2-Cluster 

Require: a set Ao of meaningful attributes: for each a G Ao, classes{a) is the set 
of basic classes of C whose C 2 short description contains a. 

Ensure: return a lattice organized in levels of nodes. Each node n is character- 
ized by classes{n): the basic classes it gathers, and shortdesc(n)\ the least com- 
mon subsumer of the short description of the basic classes of the cluster. 

1: (* Initialization step gathering the biggest unions of classes having at least 
one attribute in common:*) 

2: Ai t— Ao, level(l) ,level{\ C |) t— 0 

3: for every a G Ai do 

4: let dasses{a) = {ci, ... ,Cj} 

5: let de.sc = Icsc^ {desc{cf), . . . , desc(cj)) 

6: if descHAo = {«} then 

7: add to level{j) a node n such that: 

dasses{n) = {cf , . . . , c“} ; 
shortdesc{n) = desc n Ao; 
node{{a}) = n 

8: k-i-1 

9: (* Generation of new nodes supported by fc -|- 1-itemsets : *) 

10: repeat 

11: for every pair (p, q) G Ak do 

12: if p[l] = g[l], . . . ,p[k — 1] = q[k — l],p[fc] < q[k] then 

13: let newp = p U {g[fe]}, and let Sk be the set of fc-subsets of newp. 

14: if Sk C Ak and dasses{node(p)) dasses{q[k\) 7^ 0 then 

15: add newp to Afc+i 

16: let {cq , . . . , Ci^- } be dasses{node{p)) Pi dasse.s{q\k]) 

17: let desc = lcsc 2 {desc{ci^), . . . ,desc{ci^)) 

18: if desc = newp then 

19: add to level{j) a node n such that: 

dasses{n) = {ci ^ ,... ,Ci^} ; 
shortdesc(n) = desc; 
nodeinewp) = n 

20: k^kAl 

21: until Ak ~ 0 

22: (* Creation of the lattice. For every node n, Fathers{n) group the fathers 
of n among the nodes of greater levels:*) 

23: Initialize Fathers{n) and AncNotFathers{n) to 0 for every generated node n. 

24: for i =\ C \ —1 downto 1 do 

25: for every node n G level{i) do 

26: for j = i -|- 1 to | C | do 

27: for every node m G level{j) do 

28: if classes{n) C dasses(m) and m ^ AncN ot Father s{n) then 

29: add m to Fathersin) 

30: add Fathers{m) U AncN otF other s(m) to AncN ot Father s{n) 
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Fig. 1. Example of a lattice constructed by £2-Cluster 



The following proposition summarizes the properties of the algorithm £2-Cluster. 

Proposition 5 (Properties of £ 2 -Cluster). Let TL be the lattice returned by 
£2-Cluster. 

— For each node n € H, let shortdescln) and classes{n) be respectively the 
description and the set of basic classes returned by C2-Cluster: 

shortdesc{n) = lcsc2{shortdesc{abstc2{ii)), • ■ • ) shortdesc{abstc2{ik))) 

where {*i, ■ • • , */c} = UcGcJasse.(n) mst{c) . 

— FL is a Galois lattice, i.e. for every node n, the pair {classes{n), 
shortdesc{n)) is maximal in the following sense: there is no m G FL such that 
classes{n) C classes{m) and shortdesc{n) = shortdesc{m) , or shortdesc{n) 
-< shortdesc{m) and classes{n) = classes{m) . 

4 Refinement in 

The goal of this step is to refine a part of the lattice FL computed by C2- Cluster 
based on the more expressive language £3. This step is achieved after a user 
chooses one node Fatn and one of its descendants Sonn in FL. Algorithm 2 
describes how new nodes are possibly added between Sonn and Fatn. Those 
new nodes correspond to clusters whose descriptions in £2 did not distinguish 
from those of Fatn or Sonn, while having distinct descriptions in £3. A closure 
operation on those nodes is necessary in order to make their £3 descriptions 
maximal w.r.t the union of basic classes which they gather. C^-Cluster applies 
after the descriptions in £3 (denoted desc3 in Algorithm 2) have been computed 
for Sonn and Fatn. Those computations are least common subsumer calculations 
whose overall time cost is polynomial w.r.t to the size and the number of the 
instances of the basic classes involved in Fatn. 

Let us illustrate the application of £3-Cluster on the nodes cl c3 c5 and cl 
of the lattice of Fig. Q assuming that the £3 descriptions of the involved basic 
classes are: 
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Algorithm 2. Zls-Cluster 

Require: Two nodes Fatn and Sonn such that classes{Sonn) C classes{Fatn). 
Ensure: return a lattice between Fatn and Sonn 
1: L-Cl {classes{Fatn) , dasses{Sonn)} 

2: LRes-Cl •(— L-Cl ; Nodes {Sonn} 

3: for every node n G Nodes do 

4: for every class c G classes{Fatn) \ classes(n) do 

5: Change false ; classes classes(n) U {c} 

6: if classes 0 L-Cl then 

7: L-Cl L-Cl U{cZasses} 

8 : descS Icsc^ {descS{n), desc3{c)) 

9: (* Closure operation: *) 

10: for every class Cl G cZasses(Eatn) \ classes do 

11: if desc3(Cl) desc3 then 

12: add Cl to classes ; Change true 

13: if classes ^ LRes-Cl then 

14: add a node p to Nodes such that classes{p)— classes and 

desc3{p)=desc3 

15: LRes-Cl LRes-Cl U{cZasses(p)} 

16: if Change = true then 

17: L-Cl L-Cl Udasses(p) 

18: Suppress n from Nodes 



desc{ci)={att^ : {vl,v3}, att^ ■ {v2,w4},att| : {f6}} desc{c 3 )={att\ : {w3}, 
attl : : {f7}} desc{c^)={att\ : {w5},att| : {n7,t;8}}. 

LRes-Cl and L-Cl are initialized to {{cl, c3, c5}, {cl}}. 

• Gathering c3 with cl is considered first: 

desc3={attf : {vl,v3},att2 : {v2,v4},attl : {v6,v7}}, classes={cl,c3} 

Since desc3(c5) is not subsumed by descS, the new node clc3 is added. 

LRes-Cl is updated to {{cl,c3,c5}, {cl}, {cl,c3}}. 

• Gathering c5 with cl is now considered: 

desc3={attf : {wl, u3, c5}, • {c2,c4},aft| : {c6,t;7,c8}}, classes={cl,c5} 

Since desc3(c3) is subsumed by desc3, c3 is added to classes, which is updated 
to {cl,c3,c5}. The node corresponding to cl c5 is not added since it is not 
closed, the node corrresponding to its closure cl c3 c5 is not added either because 
{cl, c3, c5} is already in LRes-Cl. 

The following proposition summarizes the main properties of C^- Cluster. 
Proposition 6 (Properties of ffs-Cluster). LetCi be the set of basic classes 
of the father node. Let C 2 (C 2 C Ci) be the set of basic classes of the son node. 
Let Fiz be the lattice returned by £ 3 -Cluster. 

— For each node n G "Hs, let desc3{n) and classes{n) be respectively the de- 
scription and the set of basic classes returned by L^-Cluster: 

desc3{n) = Icscsiabstc^iii), ■ ■ ■ , abstc^iik)) 

where [ii,...,ik} = UcecZasse^(n) inst{c). 
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— Hs is a complete Galois lattice, i.e. for every node n, the pair {classes{n), 
desc 3 {n)) is maximal, and T-L^ contains every node verifying the maximality 
criteria and whose set of classes includes C2 and are included in C\. 



5 Complexity and Experimental Results 

The following results come directly from known results in description logic and 
in Galois lattice. Since £3 is a subset of the G-GLASSIG description logic P, 
the complexity of checking subsumption in £3 is quadratic w.r.t the maximal 
size of class descriptions and computing the Ics of £3 descriptions is linear in the 
number of descriptions and quadratic in their size. The worst time complexity of 
£2-Gluster is exponential in the maximal size of the basic classes £2 descriptions. 
The worst time complexity of C^-Cluster is exponential w.r.t | classes{Fatn) \ 
— I classes(Sonn) |. 

We have evaluated our approach using a real dataset composed of 2274 com- 
puter products extracted from the G/Net catalog. Each product is described 
using a subset of 234 attributes, possibly multi-valued. There are 59 types of 
products and each product is labelled by one and only one type. The goal of 
the experiment was twofold: to assess the efficiency and the simplicity of the 
lattice for the first clustering step (£2-Gluster) and to show the accuracy of the 
refinement of a part of the lattice using the second clustering step (£3-Gluster) . 

In order to make the £2 lattice even simpler, the number of nodes obtained 
with £2-Gluster may be parametrized by a threshold n used to restrict the nodes 
that appear in the lattice to gather at least n basic classes (ISI)- Figure 0 
shows that, as it is mentionned in this quantitative criteria allows us to 
significantly decrease the size of the lattice. Figure 0 illustrates the simplicity of 
the £2 descriptions and the significance of the nodes. 

£3-Gluster allows to distinguish nodes that cannot be distinguished by £2- 
Gluster. For instance, if £3-Gluster is applied when an end-user chooses to refine 
the £2 lattice between the node (a) and the node (b) in Fig. 0 the aggrega- 
tion of all types of drivers (i.e. RemovableTapeDrive, RemovableDiskDrive 
and HardDiskDrive) is part of the £3 lattice. This new cluster appears for the 
following reasons: no driver is described using the attributes Stor.Controller / 
RAID Level, N etworking / DataLinkProtocol or Networking /Type (those at- 
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Fig. 2. Quantitative results of £ 2 -Cluster 
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Fig. 3. A part of the C 2 lattice for C/net (n=3) 



tributes were optional in £3 description of (a)). In addition, the value SCSI for 
the attribute Star. Controller /Type is not a possible value for a driver. 



6 Conclusion and Discussion 

This paper has proposed an approach to organize into clusters large sets of 
semistructured data. The scaling up of the approach is made possible because its 
complexity is remained in control in different ways: (1) the data are aggregated 
into basic classes and the clustering applies on the set of those basic classes 
instead of applying on the data set (2) the two-step clustering method first 
builds a coarse hierarchy, based on a simple language for describing the clusters, 
and uses a more elaborate language for refining a small subpart of the hierarchy 
delimited by two nodes. 

Related Work: Our work can be compared with existing work in machine 
learning based on more expressive languages than propositional language and/or 
using a shift of representation. Most work on expressive languages has been de- 
velopped in a supervised setting (e.g. Inductive Logic Programming), while little 
work exists in an unsupervised setting. We can cite KBG 0, TIC jSj and mi 
which perform clustering in a relational setting. The main difference with our 
approach is that they use a distance as a numerical estimation of similarity. 
Although the best representation of a cluster is the least common subsumer 
of its instances, they approximate it numerically by the cluster centroid (i.e.. 
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the point that minimizes the sum of squared distances). The reason is that, 
in contrast with our setting where the Ics computation in £3 is polynomial, 
Ics computing in their first-order language may be exponential. KLUSTER cni 
refines a basic taxonomy of concepts in the setting of a description logic for 
which computing Ics is polynomial. In KLUSTER, the clusters are not unions 
but subconcepts of primitive concepts, and the refinement aims at learning dis- 
criminating definitions of mutually disjoint subconcepts of a same concept. As 
for the use of a shift of representation, it is used in supervised learning in or- 
der to increase accuracy (i.e. the proportion of correctly predicted concepts in a 
set of test examples) psil4j or to search efficiently a reduced space of concepts 
m- In unsupervised learning, shift of representational bias may be used to 
change the point of view about the data. For instance, Cluster/2 f2| provides 
a user with a set of parameters about his preferences on the concepts to be 
created. Finally, the two-step clustering approach proposed in is similar in 
spirit with our clustering in £2 since it first identifies basic clusters (as high 
density clusters) before building more general clusters that are unions of those 
basic clusters. 

Perspectives: We plan to extend our current work to take nested attributes 
and textual values into account in £3 in order to fully deal with XML data. 
Another relevant perspective of this work is the discovery of associations |3 . 
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Abstract. Unsupervised clustering algorithms aims to synthesize 
a dataset such that similar objects are grouped together whereas 
dissimilar ones are separated. In the context of data analysis, it is 
often interesting to have tools for interpreting the result. There are 
some criteria for symbolic attributes which are based on the frequency 
estimation of the attribute-value pairs. Our point of view is to integrate 
the construction of the interpretation inside the clustering process. To 
do this, we propose an algorithm which provides two partitions, one on 
the set of objects and the second on the set of attribute-value pairs such 
that those two partitions are the most associated ones. In this article, 
we present a study of several functions for evaluating the intensity of 
this association. 

Keywords. Unsupervised clustering, conceptual clustering, asso- 
ciation measures. 



1 Introduction 

One of the main data mining process consists in reducing the dimension of a 
dataset to increase knowledge and understanding. When no prior information 
are available, unsupervised clustering can be used to discover the underlying 
structure of the data. Indeed, those algorithms aim to build a partition on the 
objects such that the most similar objects belong to a same cluster, and the 
most dissimilar belong to different clusters. Hence, those procedures synthesize 
the data into few clusters. There is however no consensus of the algorithm to 
use because there are many ways to evaluate the proximity between objects and 
the quality of a partition. Furthermore, the cardinality of the set of all possible 
partitions increases exponentially with the size n of the set of objects, which 
leads to use fastest but often rough approximated optimization procedures. 

Among the algorithms, we can distinguish two main families. The first one 
gathers numerical algorithms. They can be characterized by the computation of 
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a distance between pairs of objects. This synthesizes all the dimensions of the 
problem into a single one. The distance is used to construct the partition. For 
example in the K-MEANS algorithm |,IU88| . the distance is the Euclidean one 
computed between the descriptive vectors of the objects embedded in the metric 
space defined by the attributes. The objective function is equal to the sum over 
all the clusters of the intra-class variance. Unfortunately, this function favors 
over-cut partitions and it is necessary to fix the number K of clusters before 
using it. In the case of the EM algorithm the distance is evaluated by 

a multivariate Gaussian density. At each step, the memberships of the objects 
to the different clusters are evaluated, just as the parameters of the Gaussian 
densities associated to each cluster. But this algorithm is still dependent on an 
a priori number of clusters. 

Whereas statistical clustering methods are often constructed to process data- 
sets described by continuous features, conceptual clustering methods mainly fo- 
cus on symbolic features |TB01iBWL95l . They aim to provide a better integra- 
tion between the clustering and the interpretation stages of the data analysis pro- 
cess. Each feature is an attribute of discrete type having several different values. 
Attribute-value pairs are used in the construction of the clustering. Those algo- 
rithms built a hierarchy of concepts using probabilistic representation based on 
a conditional probabilistic vector of the apparition of the several attribute- value 
pairs in the several clusters. In GOBWEB |Fis87IFis96) . the objective function 
measures the average increase in the prediction of attribute-value pairs knowing 
the partition. The optimization procedure is incremental but order dependent. 

Sharing the same aims than conceptual clustering (dealing with symbolic fea- 
tures and combining an interpretation with the obtained partition), we focus in 
this article on a non hierarchical approach of the problem. This type of methods 
consists in the construction of two linked partitions, called a bi-partition, one 
on the set of objects and the second one on the whole set of attribute-value 
pairs. The interest of such a method is to discover the underlying structure of 
the data on the point of view of the objects as well as the descriptors. Thus, we 
search a bi-partition such that a unique cluster of objects fits a unique cluster 
of attribute-value pairs and conversely. Gonsequently, each partition is an inter- 
pretation of the other one, making easier the understanding of the results. Such 
methods have already been built. The simultaneous clustering algorithm 
is an adaptation of the nuees dynamiques of Diday |GDG+88j for symbolic data. 
It consists in searching a bi-partition with the partition of the set of objects in 
a priori K clusters and the partition of the attribute-value pairs in L clusters. 
An ideal binary table of dimensions K x L is constructed, such that the gap be- 
tween the initial data table structured by the two partitions and the ideal table 
is minimized. This method needs to a priori fix the number of clusters (AT, L) of 
the bi-partition and the iterative procedure leads to a local optimum. To avoid 
those drawbacks, we propose a new algorithm based on the optimization of an 
objective function which does not need to a priori fixed the number of clusters. 
We focus in this article on the construction of such a function. 



Comparison of Three Objective Functions for Conceptual Clustering 401 



The contributions of this paper are, first an algorithm without any pre- 
scribed number of clusters, second a modification of association measures on 
co-occurrence tables to increase their discrimination power, and third an empir- 
ical study showing the relevance of the approach. 



2 An Algorithm for the Construction of a Bi-partition 



A basic clustering algorithm consists in optimizing a function which rewards par- 
titions with interested properties. To define our algorithm, we need to construct 
a function for evaluating the quality of a bi-partition. This function must favor 
bi-partitions which satisfy the following property. 

Property The functional link, which allows to restore one partition on the basis 
of the knowledge of the second one, must be as strong as possible. Furthermore, 
both partitions of the bi-partition must have the same number of clusters. 



We denote by / such a function over Vo x Vq, where Vo is the set of 
partitions on the set of objects, and Vq is the set of partitions on the set of 
attribute-value pairs: / : Vo x Vq — )> K. Let us denote by P an element of Vo 
and Q one of Vq. 

This function must satisfy some properties. Such properties have been defined 
in supervised clustering, where the function measures the agreement between 
two partitions of the same set: the one given by the class variable and the 
one constructed by the supervised clustering method. Those properties are the 
followings 
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— The function is maximal when to each cluster of P (resp. Q) is associated 
one and only one cluster of Q (resp. P) 

— When every clusters of P can be associated to each cluster of Q indiscrimi- 
nately, then the objective function must be minimum. 

~ The function must be invariant under permutation of the clusters of O and 
under permutation of the clusters of Q. 

— The function must be able to compare two bi-partitions with different num- 
bers of clusters. 



Nevertheless, we must add two new properties due to the fact that in our 
problem none of the two partitions constituting a bi-partition is a priori fixed. 
The function must also check the two following ones when it is maximal: 

— Each object of a cluster of P owns all the attribute-value pairs belonging to 
its associated cluster of Q. 

— Each attribute-value pair of a cluster of Q is owned by all the objects of its 
associated cluster of P. 

Having define the function / to evaluate the quality of a bi-partition, the 
clustering algorithm, we propose, is based on a gradient like optimization. We 
thus propose the following algorithm. 
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Let (PqiQo) a randomized initial hi-partition 
Repeat 

Qi is fixed, we seareh Pi+i = min / (Pi^i,Qi) 

Pi+i^Vo 

Pi+i is fixed, we seareh Qi+i = min / (P^+i, Qi+i) 

Qi+lGVQ 

Until a convergence criterion is met 

To modify a given bi-partition (P^, Qi), several ways are possible; either com- 
puting (Pi+i,Qi+i) in one step, or computing (P^+i, in two steps as in 

the proposed algorithm. We choose this method to have a more tractable opti- 
mization problem and also, it is a way to fix one partition as a reference so to 
optimize the functional link with only one unknown. 



3 The Kind of Functions to Use 

The previous properties are partially satisfied by association measures, which 
have been built to evaluate the link between two attributes of discrete type. 
Those coefficients are widely used in supervised clustering [LdC96| , whereas few 
unsupervised clustering algorithms used them. RIFFLE !MH9T] uses Guttman’s 
A to measure the link between the partition (considered as an attribute) and 
each original discrete attribute. Such association measures can be adapted to be 
used as objective function in the search of a bi-partition. 

The research of criteria, on one hand sufficiently complex to discriminate 
the different situations encountered and in the other hand sufficiently simple to 
allow intuitive interpretation, led to the creation of a lot of measures. Those 
measures have been constructed on contingency tables. After presenting some of 
them, we show how we construct the co-occurrence table and we modify them 
in the clustering context. 



3.1 Some Association Measures 

In the following of the paper, pij is an estimate of the probability that the value 
i of an attribute X, and the value j of an attribute Y arise simultaneously, n 
is the cardinal of the set of objects, {pij) define the so-called contingency table 
with p,j = ^ ■ Pij and pi, = Pb *he margins. 

A first group of association measures gather divergence measures between 
probability distributions. Those coefficients evaluate the association between a 
couple of attributes by measuring the gap between the current contingency table 
constructed on the two attributes, and the one obtained in case of independence. 
The situation of independence is easily characterized hypij = pi, x p,j. A well- 
known measure of divergence is the 



^ ^ Pt.P.j 

i j 



= n 
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This measure doesn’t allow to compare contingency tables of different sizes (with 
different numbers of row and/or column), that is why we prefer to use a normal- 
ized version of this measure, the Tschuprow coefficient. 



where p and q are the numbers of different values of each attributes. 

A second class of measures gather connection indices. Those coefficients eval- 
uate the gap with the situation of functional dependency characterized by a 
function / linking Y to X , Y = f (A). 

Goodman and Kruskal mm built a family of such indices denoted by 
measure of proportional reduction in error (or PRE) . Those coefficients have an 
easy interpretation due to the fact that they evaluate a prediction rule in terms 
of probability of error. The construction of such measures requires the definition 
of three elements: 

— A prediction rule (C) of Y when X is known 

— A prediction rule (I) of Y when X is unknown 

— A measure of the error associated to the prediction rules 

The PRE measure is then equal to: 

error(I) — error{C) 
error{I) 

Guttman’s A is a PRE measure. It consists in predicting the value of an 
attribute by the most frequent one: 

(1 - maxj p,j) - P*. maxj 

1 — maxj p,j 

Goodman and Kruskal proposed another more accurate coefficient called Tf,. 
Whereas A focuses only on the most frequent value, rt, measure takes into account 
all the structure of the distribution: 

y y ^ _y 2 

1 2 

This way to define connection indices can be generalized using the notion 
of uncertainties. We call uncertainty measure a concave function J () on proba- 
bility distributions. For example, the Shannon entropy (— ’^.^Pilogpi), and the 
quadratic entropy (2 [l — J2Pi]) belong to this class. 

The gain of uncertainties allows to measure the reduction in error of the pre- 
diction of y knowing X. It is equal to AI (Y \ X) = I {P (Y))—Ex [I {P (Y \ A))]. 

We always have A I {Y \ A) > 0. Moreover, if A and Y are independent 
then A I (Y \ A) = 0. The converse is true if I is strictly concave. An index of 
connection is then defined by C (Y \ A) = 
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The u coefficient is obtained when I {P (Y)) = 1 — J2jP^j When using the 
Shannon entropy as / (P (P)), we obtain the uncertainty coefficient: 

_ log Pi. + P-3 log P-3 - E* Ei Pij log P^3 

Ej P-3 logPo 

3.2 How to Build the Co-occurrence Table 

In our problem, we search a bi-partition constituted of two partitions such that 
their association is maximal. However, whereas the previous measures are based 
on contingency table, i.e. a table crossing two partitions on a same set, the 
partitions considered in our problem are built on separate sets which have a 
semantic link expressed through the data table. This link allows us to construct 
a co-occurrence table {r]ij). We consider h attributes Vi with values in a discrete 
space dom^ {Vi : O —>■ dom^) and denote by Q the set of attribute- value pairs 
= 1+liLi dom^^ . Using the previous notations, we built a co-occurrence table 

between a partition P — {Pi , . . . , Pk) on the set O of objects and a partition 
Q = {Qi: ■ ■ ■ : Qk) on the set Q such that the elements of this table equal the 
number of attribute-value pairs of Qj taken by the objects of Pi. More precisely, 

h 

^ ^ViV),y 

x^Pi y^Qj 

where (5 is the Kroneckei0 symbol. We also use the following notations: r]i, = 

T,f=i V^3^ P-3 = E5i Pi3 and 77.. = Ej5i Py = tt (O) x h. 

Then, we compute the previous measures by substituting in the formulas pij, 
Pi,, pj and n by respectively rjij, iji , rj j and ij . Nevertheless, the computation 
of the Uncertainty coefficient requires the normalization of the co-occurrence 
table. That is why we divide every rjij by 77.. when computing this coefficient. 

3.3 Adaptation of Functions 

Recall that the association measures only partially check the necessary proper- 
ties. On a contingency table the notion of purity has no meaning whereas it is 
a key point in our problem when wanting to compare tables of different sizes. A 
cluster of objects is all the more pure since the objects have similar description 
on all the attributes. Similarly, a cluster of attribute- value pairs is all the more 
pure since the attribute-value pairs are taken by the same set of objects. 

For example, we consider a data table composed of three pure clusters on 
objects and attribute-value pairs, and the perfect bi-partition associated (see 
Fig. lU left). In this case, the association measure is maximum. If we merge two 
classes of objects, and the two associated classes of attribute-value pairs (see 
Fig. [Dright), the association measure is still maximum. Consequently, those two 
different situations are not discriminated by the measure. 



^ 5vi(x),y = 1 if Vi {x) = y, 5vi(x),y = 0 otherwise 



Comparison of Three Objective Functions for Conceptual Clustering 405 



Q Q' 




Fig. 1. A maximum value for two different situations 



To overcome this drawback, we use a diversity measure between a cluster of 
objects and a cluster of attribute-value pairs to map rj^j into [0, 1] such that 
is maximum when the cluster is pure. 

Since each object owns a unique value per attribute, r]ij is maximum when 
each object of Pi owns, for all the attributes represented in Qj, a value belonging 

to Qj. In this case, = jj {Pi) x ~ ) 0 )’ number of 

objects times the number of different attributes belonging to Qj. Thus in general 
rjij divided by jl(Ti) x ~ 0) belongs to [0,1]. However, this is 

not sufficient to discriminate the cases of Fig. E We must penalize Qj with 
several values per attribute to solve the problem. We decide to penalize it by 
Wa^Hj tt (doma n Qj), with iJj = {a S [1, h] \ dom^ fl Qj ^ 0}. It is equal to 
one only when there is one value per attribute in Qj and is greater than one 
otherwise. 

Consequently to map rjij into [0, 1], we replace rjij by the following diversity 
measure in the co-occurrence table, 

m 

tt {Pi) X ^1 — ^ n ^ (doma n Qj) 

i a£Hj 

Nevertheless, modifying the values in the co-occurrence table does not influ- 
ence the value of the association measure used. Indeed, association measures rely 
on the evaluation of the similarity between rjij and rji and/or rj j. That is why 
in order to take into account the effect of the diversity measure we have to com- 
pute a global index of diversity. It consists in the embedding of the co-occurrence 
table in the set of assignment matrices to force a functional link between the 
elements of a bi-partition. The set of possible assignment matrices A = {Aij) 
contains all matrices such that 

Vf,3!/ such that Aij = r]ij 7^ 0 and Vj, 3!i such that Aij = rjij ^ 0 

Among the set of assignment matrices, we choose the one whose coefficients 
average is maximum. The global index of diversity is this average. Notice that the 
association measures belong to [0, 1] and that the average of {Aij) also belongs 
to [0, 1]. We thus weight the association measures by multiplying it by this global 
diversity index. 
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4 An Experimental Study of the Functions 

In this section, we empirically study the functions regarding their capabilities 
to discriminate associated partitions. Those functions are the r;,, the Tschuprow 
and the Uncertainty coefficients. Whereas in the previous section we modify 
those functions to ensure their discrimination of the pure bi-partition, we study 
in this section the regularity of those functions over others bi-partitions. 

4.1 Study of the Quality Indices of the Partitions 

In [H.FOOj we have proposed a graph theoretical approach to define an order on 
the set of partitions. It consists in constructing a data table such that there 
exist a bi-partition whose clusters are all pure. This bi-partition is then used as 
a reference called ideal bi-partition. A distance between each partition of Vo 
and the partition on O belonging to the ideal bi-partition is then computed 
to order Vo- On a same manner, a distance between each partition of Vq and 
the partition on Q belonging to the ideal bi-partition is also computed. The 
distances used ITOi! are well discriminant and evaluate the proximity between 
two partitions in terms of similarity on the variables and objects shared between 
the closest clusters of both partitions. 

That is why we study the discrimination power of a function on a set of 
partitions through the link between function’s values and those orders on the set 
of partitions. 

The following graphs represent the variations of the measures regarding the 
distance of the partitions to the associated ideal one. The partitions on the set 
of objects O are ordered on abscissa axis (right) . On the ordinate axis (left) are 
ordered partitions on the set of attribute-value pairs Q. On the z-axis is plotted 
the value of the function computed with the two partitions. The partitions are 
based on a 12 x 12 data table composed of three pure clusters on each set. In 
a first step, we have generated all the partitions on each set. But given the fact 
the number of partitions on each set is huge (more than 4 millions), we could 
not plot the graphs over the whole set of couples of partitions. That is why we 
selected a subset of 100 partitions in each set. We did not choose those partitions 
randomly for two reasons. First, if we pick up partitions in a uniform manner, 
we almost obtain partitions with worst values regarding the function and the 
distance. Indeed, among the 4 millions of partitions there are lots of bad ones. 
Secondly, we do not know the distribution of good partitions. Consequently, we 
chose partitions among the exhaustive set such that the partitions are well spread 
over the distance. 

The Tschuprow measure (Fig. 0 left) seems smooth and regular over the 
couples of partitions. The incline of the surface is correctly oriented, with high 
values for good partitions and slow decrease towards the bad ones. Nevertheless, 
the slope is not very important. This could be an obstacle for the optimization 
procedure. In fact, in case of an optimization based on local (with respect to the 
distance) descent, the bumpiness of the function might be an obstacle. However, 
when considering stochastic procedures, like genetic algorithms, the slope of the 
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Fig. 2. Tschuprow and Tschuprow adapted 



surface induces the survival of better individuals (with respect to the distance) . 
There is a step in the border of the graph which means that a small increase of 
the quality of very bad partitions leads to an important increase of the objective 
measure. The modification of the measure (Fig. 0 right) globally increases the 
slope of the surface. Whereas the surface is a little bit more bumpy, the stochastic 
optimization should be more easy on this function. 




Fig. 3. T6 (left) and rj adapted (right) 



On Fig. 0 (left) we can see that the r;, measure discriminates almost well 
the couples of partitions. The highest values of the function are obtained for 
good bi-partitions, and the values of the function decrease with the distance. 
The rough patches of the surface are more important than for the Tschuprow 
measure but the slope is much more important. The diversity coefficient (Fig. 0 
right) flatten the surface partially erasing the bumps. 

The results obtained with the Uncertainty measure (Fig. 0 left and right) 
are very similar to those given by the rj, function. It is visually perceptible that 
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Fig. 4. Uncertainty coefficient original (left) and adapted (right) 



those functions are similar each others and different to the Tschuprow measure. 
This is due to the fact that they do not evaluate the same property on the co- 
occurrence table. Whereas the Tschuprow evaluates the distance between the 
current table and the one corresponding to the independence situation, the two 
others evaluate the strength of the functional link between the two partitions. 

4.2 Smoothness of the Functions 

In this section, we study the behavior of the functions when the co-occurrence 
matrix is slightly modified. Those modifications are built by interpolation be- 
tween two matrices. The following graphs represent the value of the functions at 
each of the 150 steps of the interpolation. 





Fig. 5. Linear interpolation to Gaussian matrix non adapted vs adapted 



In Fig. El we interpolate the matrix corresponding to the ideal bi-partition to 
a Gaussian modification of it. The interpolation is linear. We do this to measure 
the resistance of the function to a regular destruction of the functional link. 
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The three functions decrease quasi linearly (Fig. Elleft). This graph confirms 
that Tf, and Uncertainty coefficients have a similar behavior, which is rather 
different than those of the Tschuprow measure. Its decrease is slower regarding 
the two other functions. The diversity coefficient (Fig. 0 right) modifies the 
curves so that they have a similar linear slope. 




Fig. 6. Random interpolation to random matrix non adapted vs adapted 



In Fig. 0 we interpolate the ideal co-occurrence matrix to a totally random 
one. The interpolation is linear for all cells of the matrix, but each cell has a 
randomly chosen number of steps. Consequently each cell of the matrix has its 
own speed of interpolation. This case is far less regular than the previous one. 

Along the slight modifications of the matrix, the functions decrease more 
quickly than in the previous interpolation (Fig. 0 left). Nevertheless, the func- 
tions are still very smooth. The behavior of the functions is not affected by the 
diversity coefficient (Fig. 0 right). 

Finally, the proposed modification of the association measures leads to an 
increase in their discrimination power but keeping their good behavior in resis- 
tance and regularity towards the measure of a functional link. 

5 Conclusion and Perspective 

In this article, we have presented an algorithm for finding bi-partition in un- 
supervised clustering. It is based on the search of a couple of the most associ- 
ated partitions. Those partitions are based on the set of objects and the set of 
attribute- value pairs which are linked in the original dataset. In order to find this 
bi-partition, we propose three objective functions to optimize. We have adap- 
tated the Tschuprow, the r;, and the Uncertainty measures to the unsupervised 
clustering problem. The experimentation we provide, give two main information. 
First we notice that the t;, measure and the Uncertainty coefficient have similar 
behaviors. The Tschuprow function is quite different, more smooth but may be 
less discriminant. Secondly, the application of the diversity coefficient we have 
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introduced, to allow the functions to check all the required properties, slightly 
modify the functions. Globally, the functions are smoother and discriminant. In 
a further work we will present optimization procedures. 
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Abstract. Changes in the normal rhythmicity of a human heart may result in 
different cardiac arrhythmias, which may be immediately fatal or cause irrepa- 
rable damage to the heart when sustained over long periods of time. The ability 
to automatically identify arrhythmias from ECG recordings is important for 
clinical diagnosis and treatment, as well as, for understanding the electrophysio- 
logical mechanisms of the arrhythmias. This paper proposes a novel approach 
to efficiently and accurately identify normal sinus rhythm and various ventricu- 
lar arrhythmias through a combination of phase space reconstruction and ma- 
chine learning techniques. Data was recorded from patients experiencing spon- 
taneous arrhythmia, as well as, induced arrhythmia. The phase space attractors 
of the different rhythms were learned from both inter- and intra-patient ar- 
rhythmic episodes. Out-of-sample ECG rhythm recordings were classified using 
the learned attractor probability distributions with an overall accuracy of 83.0%. 



1 Introduction 

Thousands of deaths occur daily due to ventricular fibrillation (VF)[1]. Ventricular 
fibrillation is a disorganized, irregular heart rhythm that renders the heart incapable of 
pumping blood. It is fatal within minutes unless externally terminated by the passage 
of a large electrical current through the heart muscle. Automatic defibrillators, both 
internal and external to the body, have proven to be the only therapy for thousands of 
individuals whom experience ventricular arrhythmia. There is evidence [2] to suggest 
that the sooner electronic therapy is delivered following the onset of VF, the greater 
the success of terminating the arrhythmia, and thus, the greater the chance of survival. 
Defibrillators are required to classify a cardiac rhythm as life threatening before the 
device can deliver shock therapy; the patient is usually unconscious. Because of the 
hemodynamic consequences (i.e., the heart ceases to contract, thus no blood flows 
through the body) that accompany the onset of lethal VF, a preventive approach for 
treating ventricular arrhythmia is preferable, such as low-energy shock, pacing regi- 
mens and/or drug administration to prevent the fatal arrhythmia from occurring in the 
first place. Furthermore, there is evidence [3] to suggest that high-energy shocks de- 
livered during lethal arrhythmia may be harmful to the myocardium. Thus, the ability 
to quickly identify and/or predict the impending onset of VF is highly desirable and 
may increase the alternate therapies available to treat an individual prone to VF. 
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Many of the current algorithms differentiate ventricular arrhythmias using classical 
signal processing techniques, i.e., threshold crossing intervals, autocorrelation, VF- 
filter, spectral analysis [4], time-frequency distributions [5], coherence analysis [6], 
and heart rate variability [7, 8]. In order to improve frequency resolution and mini- 
mize spectral leakage, these algorithms need five or more seconds of data when clas- 
sifying the rhythms. This paper proposes that phase space embedding [9] combined 
with data mining techniques [10] can learn and accurately characterize chaotic attrac- 
tors for the different ventricular tachyarrhythmias in short data intervals. Others who 
have used phase space techniques to study physiological changes in the heart include 
Bettermann and VanLeeuwen [11], who demonstrated that the changes in heart beat 
complexity between sleeping and waking states were not a simple function of the 
heart beat intervals, rather the changes in heart beat complexity were related to the 
existence of dynamic phases in heart period dynamics. 

In this study, signals from two leads of a normal twelve lead ECG recording [12, 13] 
are transformed into a reconstructed state space, also called phase space. Attractors 
are learned for each of the following rhythms: sinus rhythm (SR), monomorphic ven- 
tricular tachycardia (MVT), polymorphic ventricular tachycardia (PVT), and ventricu- 
lar fibrillation. A neural net is used to learn the attractors using features formed from 
the two-dimensional reconstructed phase space. Attractors are learned and tested from 
inter- and intra-patient data. 

1.1 ECG Recording Overview 

An ECG recording is a measure of the electrical activity of the heart from electrodes 
placed at specific locations on the torso. A synthesized surface recording of one 
heartbeat during SR can be seen in Eigure 1. The cardiac cycle can be divided into 
several features. The main features are the P wave, PR interval, QRS complex, Q 
wave, ST segment, and T wave. Each of these components represents the electrical 
activity in the heart during a portion of the heartbeat [14]. 

• The P wave represents the depolarization of the atria. 

• The PR interval represents the time of conduction from the onset of atrial ac- 
tivation to the onset of ventricular activation through the bundle of His. 

• The QRS complex is a naming convention for the portion of the waveform 
representing the ventricular activation and completion. 

• The ST segment serves as the isoelectric line from which amplitudes of other 
waveforms are measured, and also is important in identifying pathologies, 
such as myocardial infarctions (elevations) and ischemia (depressions). 

• The T wave represents ventricular depolarization. 

Recordings seen at different lead locations on the body may exhibit different morpho- 
logical characteristics. Differences in the ECG recordings from one lead to another 
are a result of the electrodes being placed at different positions with respect to the 
heart. Thus the projection of the electrical potential at a point near the sinoatrial node 
would differ from that seen by an electrode near the atrioventricular node. Differences 
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in recordings from one person to another may be due to the difference in the size of 
the hearts, the orientation of the heart in the body, exact lead location, and the 
healthiness of the heart itself. 




Fig. 1. Synthesized ECG recording for one heartbeat 



2 Methods 

2.1 Recordings 

Simultaneous recordings of surface leads II and VI of a normal 12 lead ECG [12, 13] 
were obtained from six patients using an electrophysiological recorder. These patients 
exhibited sustained monomorphic ventricular tachycardia, polymorphic ventricular 
tachycardia, ventricular fibrillation and/or any combination of these rhythms during 
electrophysiological testing (EP) and/or automatic implantable cardio- 
verter/defribrillator (AICD) implantation. None of the data was from healthy patients. 

Two independent observers classified the ECG recordings as one of the following 
rhythms: VF, PVT, MVT, and SR. The criteria for classifying of the different rhythms 
were [15-17]: 

• VF was defined by undulations that were irregular in timing and morphology 
without discrete QRS complexes, ST segments, or T waves with cycle length 
< 200 msec. 

• PVT was defined as ventricular tachycardia having variable QRS morphol- 
ogy but with discrete QRS complexes with cycle length < 400 msec. 

• MVT was defined as ventricular tachycardia having a constant QRS mor- 
phology with cycle length < 600 msec. 

• SR was defined by rhythms exhibiting P waves, QRS complexes, ST seg- 
ments, and T waves with no aberrant morphology interspersed in the data in- 
terval. 

Ventricular tachycardia is most commonly associated in patients with coronary artery 
disease and prior myocardial infarctions. Patients with dilated cardiomyopathies, 
arrythmogenic right ventricular dysplasia, congenital heart disease, hypertrophic car- 
diomyopathy, or mitral valve prolapses experience VT. Infrequently VT occurs in 
patients without identifiable heart abnormalities[18]. Ventricular fibrillation occurs 
primarily in patients with transient or permanent conduction block. Patients expert- 
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ence VF under a variety of conditions, including: 1) electrically induced by a low- 
intensity stimulus delivered while the ventricles are repolarizing; 2) electrically in- 
duced by a burst (approximately 1 second duration) of 60 FIz AC current; 3) sponta- 
neously induced due to ischemia leading to a conduction block; 4) reperfusion- 
induced; and 5) electrically induced by high-intensity electric shocks[16]. 

Examples of the different rhythm morphologies can be seen in Fig. 2. 



Different Rhythms 




sec 



Fig. 2. Recording for individual examples of rhythm morphologies: monomorphic ventricular 
tachycardia (MVT), polymorphic ventricular tachycardia (PVT), ventricular fibrillation (VF), 
and sinus rhythm (SR) 



2.2 Preprocessing 

Data were antialiased filtered with a cutoff frequency of 200 Hz and subsequently 
digitized at 1,200 Hz. Up to 60 seconds of continuous data were digitized for each 
rhythm. In this study, the data was divided into 2.5-second contiguous intervals of 
MVT, PVT, VF or SR rhythms. The data were zero-meaned prior to further analysis. 



2.3 Feature Identification 

A two-dimensional phase space was constructed using the II and VI ECG recordings. 
Eigure 3 illustrates the generated phase space. 

Each rhythm is attracted to a different subset of the phase space. This subset of the 
phase space is the attractor for that particular rhythm. Visually, one can differentiate 
the rhythm attractors in Fig. 3. However, for an automatic defibrillator to automati- 
cally classify rhythms, features must be determined that define each attractor. These 
features were generated using the following method. 
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Psuedo Code of Feature Identification 

Combine all lead II training intervals 
Take histogram of combined signals 

Determine boundary values that separate the com- 
bined data into 10 equally filled bins (each 
bin contains -10% of data) 

Repeat for lead VI 

Using boundaries for each lead, create 100 regions 
in the phase space . 

For each individual training interval 

Determine percentage of data points in each region 



Reconstructed Phase Space tor Different Rhythms 
400 I 1 400 




■400 I ^ ^ 1 

-1000 -500 0 500 

MVT 



200 
0 

-200 
-400 

-1000 -500 0 500 

PVT 





VF SR 



Fig. 3. Generated two-dimensional phase space for examples of MVT, PVT, VF, and SR. No- 
tice that the different rhythms fill a different subset of the phase space 



An example of the regions subdividing the phase space for an SR rhythm can be seen 
in Fig. 4. 



2.4 Attractor Learning 

The attractors were learned using neural networks with 100 inputs, one output, and 
two hidden layers. The first and second hidden layers consisted of 10 and 3 neurons 
with tan-sigmoid transfer functions, respectively. The output layer was a log-sigmoid 
neuron. The neural net was learned using the Levenberg-Marquardt algorithm in 
MATLAB. The inputs to the neural networks were the percentage of data points in 
each feature bin described in previously. Leave-one-out cross-validation [19] was 
used in the training and testing of the neural networks. Given an indexed data set 
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{di :i = l,...,n} containing n elements, n training/testing runs are performed. For the 
7* run, the test set is \dj | and the training set is {di 

Individual neural networks were used to classify each rhythm. The output of the neu- 
ral net was rounded, in order that 1 classified the input data as the specific rhythm, 0 
classified it as some other rhythm. For a patient exhibiting two different morpholo- 
gies, two neural networks would be trained and tested to classify the ECG intervals. 
An example of the classifier architecture for Patient 2 can be seen in Fig. 5. To be a 
legitimate classification, only one neural network can classify the signal. 

Feature Boundaries fora Sinus Rhythm 




Lead II 

Fig. 4. Example of feature bin boundaries for a 2.5 second recording of sinus rhythm 

2.5 Comparative Analysis 

We compare our new method against three others. The first comparison is to a method 
based on the Lempel-Ziv complexity measure. The second comparison is to a method 
based on heart rate. The third comparison is to two independent human expert observers. 




Fig. 5. Classifier architecture. The number of rhythm neural nets corresponds to the number of 
rhythms for a particular set of data. For sets of data with more than two rhythms to classify the 
XOR box is more complicated than a single exclusive OR 
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Zhang et al. [20] proposed a method for detecting MVT, VF, and SR using the Lem- 
pel-Ziv complexity measure. The complexity measure is a function of the number of 
patterns found in a string of threshold crossings. For each interval of data, a new 
threshold was calculated. As with the method proposed in this paper, Zhang’s com- 
plexity measure does not need to detect the occurrences of beats. They used various 
interval lengths to determine the minimum amount of data needed to attain 100% 
training accuracy; no test accuracy was determined. A seven second interval was 
found to be the minimum length needed to correctly discriminating the three rhythms. 
For the rhythms (MVT, VF, and SR), intervals of length two and three seconds 
achieved training accuracies of (93.14%, 95.10%, and 98.04%) and (93.14%, 97.55%, 
and 95.59%), respectively. Zhang classified the rhythms using the following cutoff 
values for the complexity measures: 

• SR - for complexity measures less than 0.150 

• MVT - for complexity measures between 0. 150 and 0.486 

• VF - for complexity measures greater than 0.486. 

Fleart rate is used in many AICDs to discriminate one rhythm from another. Med- 
tronic, Inc. a commercial maker of AICDs uses rate detection zones and different 
counts to detect and classify tachyarrhythmias [17]. AICDs count the number of beats 
in each detection zone, if a specified number of beats are within a particular zone 
without a SR rhythm beat being detected, the interval is marked as a tachyarrhythmia. 
Since the data intervals used are only two and half seconds long, there are not enough 
beats to be counted, so only the heart rate is used to classify the rhythm intervals. For 
each individual interval, thresholds for marking a new beat were set to 60% of the 
maximum amplitude of that interval. 



3 Results 

3.1 Data 

Six patients comprised the study population. The heart rhythms exhibited by the six 
patients can be seen in Table 1. Four of the patients exhibited different combinations 
of two or three types of rhythms. The last two patients exhibited all four types of 
rhythms. Two independent observers performed the original rhythm classification. 

Table 1. Patient and number of 2.5s rhythm intervals experienced 



Patient 


MVT 


PVT 


VF 


SR 


1 






23 


27 


2 




6 


12 




3 




23 




30 


4 


15 


8 


4 




5 


15 


8 


2 


33 


6 


20 


6 


5 


34 
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Overall inter-observer agreement for rhythm classification was 80.7%. The two ob- 
servers conferred to reach consensus on the classification of the remaining 19.3%. 
The intervals used in this study were not meticulously selected to have comparable 
amplitudes, waveforms, and heart rates. The intervals were selected blindly from 
rhythms classified by the two observers. 

3.2 Intra-patient Classification 

For each patient, classifiers were created for each rhythm interval. The neural nets in 
the classifiers were able to learn the training data within approximately 20 epochs 
with 100% accuracy, with leave-one-out cross-validation. For the training data, the 
classifiers accurately identified rhythm type from 69.8% to 83.3% of the time with an 
overall average accuracy of 77.1%. The accuracy for each patient’s classifier is listed 
in Table 2. Each classifier had four possible outputs: 

• Correctly Classified - 2.5-second rhythm interval was classified correctly. 

• Incorrectly Classified - 2.5-second rhythm interval was classified as a differ- 
ent rhythm. 

• Undetermined (no classification) - 2.5-second rhythm interval was not clas- 
sified. 

• Undetermined (two classifications) - 2.5-second rhythm interval was classi- 
fied as two rhythms (It should be noted that no rhythm interval was classified 
as more than two rhythms.) 



Table 2. Intra-patient classifier accuracy 



Patient 


Correctly 

Classified 


Incorrectly 

Classified 


Undetermined 

(No 

classification) 


Undetermined 
(2 classifica- 
tions) 


Percent 

Accuracy 


1 


41 


1 


2 


6 


82.0% 


2 


15 


0 


2 


1 


83.3% 


3 


37 


3 


8 


5 


69.8% 


4 


21 


2 


1 


3 


75.0% 


5 


44 


5 


3 


6 


75.8% 


6 


51 


1 


5 


8 


78.5% 



3.3 Inter-patient Classification 

All 271 data segments from the six patients were combined and classified. The train- 
ing data was learned with 100% accuracy within approximately 30 epochs. Leave- 
one-out cross-validation was performed. The accuracy of classifying the 271 rhythm 
intervals was improved compared to the intra-patient classification accuracy. The 
classification accuracy for the 271 intervals was 83.0%, with the following break- 
down of classification: 
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• 225 were correctly classified. 

• 12 were incorrectly classified. 

• 11 were undetermined due to no classification. 

• 23 were undetermined due to two or more classifications (only one interval 
was classified as three separate rhythms). 

The confusion matrix for the proposed method is given in Table 3. Recall because of 
the structure of the proposed classifier, a data interval may be under (no classifica- 
tion) or over (two or more classifications) classified, hence the total classifications in 
Table 3 is not 271. 



Table 3. Confusion matrix for phase space classification method 





SR 


Classified As 
MVT PVT 


VF 


Valid 

Classification 


Accuracy 


SR 


117 


1 


7 


6 


109 


87.9% 


MVT 


1 


47 


5 


0 


42 


84.0% 


PVT 


3 


4 


45 


2 


39 


76.5% 


VF 


2 


0 


6 


38 


35 


76.1% 



3.4 Complexity Measure Inter-patient Classification 

Using the complexity measure algorithm from [20], the complexity measure for each 
interval was calculated. The distributions of the measures for the different rhythms 
are shown in Figure 5. It can be seen in the graph that unlike Zhang’s training results 
there is no distinct separation between complexity measures of the different rhythms; 
nor were the values attained using this data within the same ranges as those deter- 
mined by Zhang. The results are extremely poor as seen by the accuracies given in 
Table 4. 



Table 4. Confusion matrix for complexity measure classification 





SR 


Classified As 
MVT PVT 


VF 


Accuracy 


SR 


116 


8 


0 


0 


93.5% 


MVT 


50 


0 


0 


0 




PVT 


51 


0 


0 


0 




VF 


38 


8 


0 


0 
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Fig. 6. Complexity measure distribution for the all four rhythm types 

3.5 Heart Rate Inter-patient Classification 

Classification using the heart rate had an overall accuracy of 62%. Misclassifications 
occurred in all rhythm intervals. The MVT intervals had the worst accuracy. The 
classification using heart rate can be seen in Table 5. 



Table 5. Confusion matrix for heart rate classification 





SR 


Classified As 
MVT PVT 


VF 


Accuracy 


SR 


83 


38 


3 


0 


66.9% 


MVT 


0 


11 


20 


19 


22.0% 


PVT 


0 


0 


40 


11 


78.4% 


VF 


0 


1 


1 


44 


95.6% 



4 Discussion 

Ideally, an implantable antitachycardia device should be capable of several modes of 
therapy including antitachycardia pacing, low-energy cardioversion, and high-energy 
defibrillation [21-23]. Patients requiring these types of therapy often experience more 
than one rhythm type. These different arrhythmias may require different therapies. 
However, for the several modes of therapy to be available in one device, detection 
algorithms must be able to accurately differentiate among various arrhythmias. The 
results from this preliminary study are encouraging for developing accurate detection 
algorithms among the various ventricular tachyarrhythmias. The ability to accurately 
classify rhythms experienced by individual patients more than 75% of the time is in 
close agreement with the classification of trained observers. The classification accu- 
racy across all patients was better for the automated scheme than for the original clas- 
sification by trained observers. The classification performed using the complexity 
measures of the rhythms was extremely poor. It is obvious that Zhang’s threshold 
values are not generalizable. Even if new threshold values were determined for our 
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data set, their classification method would perform poorly as can he seen in Fig. 6 hy 
the strong overlapping of the classes. 

Using the reconstructed phase space to classify out-of-sample ECG recordings per- 
formed better than the classification using the heart rate alone. This is due to several 
reasons. The first and foremost was part of the new algorithm’s advantages is the abil- 
ity to classify ECG rhythms in only 2.5 seconds. Most ICDs require 10 seconds to 
classify a tachyarrhythmia. Many of the commercial detection algorithms also allow 
the medical provider to determine templates for the patient’s SR. As these were out- 
of-sample intervals no templates could be generated. Thus the detection of heartbeats 
ranged drastically from one interval to the next. Secondly, as stated previously, the 
morphology seen in an ECG recording is a function of the healthiness of the heart. 
And as each of these rhythms was recorded during electrophysiological testing (EP) 
and/or automatic implantable cardioverter/defribrillator (AICD) implantation, none of 
these hearts can be considered extremely healthy. Thus individual rhythms greatly 
vary from one patient to the next. For example during SR, one patient had T-waves 
whose amplitudes were as large as the QRS. The T-waves were counted as a new 
heartbeat, thus doubling the calculated heart rate. Finally, even though the data was 
zero-meaned linear trends were not removed from the intervals, thus fewer beats were 
counted. 

Although the proposed method was accurate 83% of the time, if used in AICDs in its 
current form, the misclassification of SR and MVT as VF could cause a patient to 
receive an unnecessary defibrillation shock which has the possibility of being detri- 
mental to the patient. Some of these false classifications were due to SR intervals in 
which the maximum amplitude of the signal was not very large, thus the phase space 
reconstruction of these non-fatal rhythms was very close to that of VF. Further im- 
provement is still needed before these short intervals can be used in commercial ap- 
plications, such as the development of multi-therapy implantable antitachycardia de- 
vices. The high classification accuracy of the proposed method within a short period 
of time reinforces the author’s conjecture that phase space is a valid starting point in 
the classification of ventricular tachyarrhythmias. Other features will need to be 
added to the proposed method to improve the classification accuracy for short inter- 
vals of data. Further investigations for defining the rhythm attractors will incorporate 
time-delay and multi-dimensional phase spaces. 

Future research into the identification of ventricular tachyarrhythmias may unveil 
electrophysiological mechanisms responsible for the onset and termination of fibrilla- 
tory rhythms. We hypothesize that the patterns of the quasi-periodic [24] attractors of 
heart rhythms change immediately prior to (within a 10-minute time period) the onset 
of a serious ventricular arrhythmia. Using these attractors, future research will focus 
on the transitions from one phase space attractor to another. This may reveal how 
changes in the attractor space correspond to heart rhythm changes, with the end goal 
being able to predict the onset of VF, thus improving available therapy and preven- 
tion. 
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Abstract. When evaluating association rules, rules that differ in both 
support and confidence have to compared; a larger support has to be 
traded against a higher confidence. The solution which we propose for 
this problem is to maximize the expected accuracy that the association 
rule will have for future data. In a Bayesian framework, we determine 
the contributions of confidence and support to the expected accuracy on 
future data. We present a fast algorithm that finds the n best rules which 
maximize the resulting criterion. The algorithm dynamically prunes re- 
dundant rules and parts of the hypothesis space that cannot contain 
better solutions than the best ones found so far. We evaluate the perfor- 
mance of the algorithm (relative to the Apriori algorithm) on realistic 
knowledge discovery problems. 



1 Introduction 

Association rules {e.g., HEE]), express regularities between sets of data items 
in a database. [Beer and TV magazine => chips] is an example of an association 
rule and expresses that, in a particular store, all customers who buy beer and a 
TV magazine are also likely to buy chips. In contrast to classifiers, association 
rules do not make a prediction for all database records. When a customer does 
not buy beer and a magazine, then our example rule does not conjecture that 
he will not buy chips either. The number of database records for which a rule 
does predict the proper value of an attribute is called the support of that rule. 

Associations rules may not be perfectly accurate. The fraction of database 
records for which the rules conjectures a correct attribute value, relative to the 
fraction of records for which it makes any prediction, is called the confidence. 
Note that the confidence is the relative frequency of a correct prediction on the 
data that is used for training. We expect the confidence (or accuracy) on unseen 
data to lie below that on average, in particular, when the support is small. 

When deciding which rules to return, association rule algorithms need to 
take both confidence and support into account. Of course, we can find any num- 
ber of rules with perfectly high confidence but support of only one or very few 
records. On the other hand, we can construct very general rules with large sup- 
port but low confidence. The Apriori algorithm |2j possesses confidence and 
support thresholds and returns all rules which lie above these bounds. However, 
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a knowledge discovery system has to evaluate the interestingness of these rules 
and provide the user with a reasonable number of interesting rules. 

Which rules are interesting to the user depends on the problem which the 
user wants to solve and hopes the rules to be helpful for. In many cases, the user 
will be interested in finding items that do not only happen to co-occur in the 
available data. He or she will rather be interested in finding items between which 
there is a connection in the underlying reality. Items that truly correlate, will 
most likely also correlate in future data. In statistics, confidence intervals (which 
bound the difference between relative frequencies and their probabilities) can be 
used to derive guarantees that empirical observations reflect existing regularities 
in the underlying reality, rather than occurring just by chance. The number of 
observation plays a crucial role; when a rule has a large support, then we can be 
much more certain that the observed confidence is close to the confidence that 
we can expect to see in future. This is one reason why association rules with 
very small support are considered less interesting. 

In this paper, we propose a trade-off between confidence and support which 
is in a way optimal by maximizing the chance of correct predictions on unseen 
data. We concretize the problem setting in Sect.|2| and in Sect. Olwe present our 
resulting utility criterion. In Sect.EJ we present a fast algorithm that finds the n 
best association rules with respect to this criterion. We discuss the algorithm’s 
mechanism for pruning regions of the hypothesis space that cannot contain so- 
lutions that are better than the ones found so far, as well as the technique used 
to delete redundant rules which are already implied by other rules. In Sect. 0 
we evaluate our algorithm empirically. Section 0 concludes. 

2 Preliminaries 

Let D be a database consisting of one table over binary attributes ai, . . . ,Ofc, 
called items. In general, D has been generated by discretizing the attributes 
of a relation of an original database D' . For instance, when D' contains an 
attribute income, then D may contain binary attributes 0 < income < 20k, 
20k < income < AQk, and so on. A database record r C {m, . . . , Ofc} is the set of 
attributes that take value one in a focused row of the table D. 

A database record r satisfies an item set x C {ai,...,Ofc} if x C r. The 
support s(x) of an item set x is the number of records in D which satisfy x. 
Often, the fraction of records in D that satisfy x is called the support of x. 
But since the database D is constant, these terms are equivalent. 

An association rule [x => y] with x,y C {a\, . . . , ak}, 2/ 0 , and x 0 j/ = 0 

expresses a relationship between an item set x and a nonempty item set y. The 
intuitive semantic of the rule is that all records which satisfy x are predicted to 
also satisfy y. The confidence of the rule with respect to the (training) database 
I? is c([x => y]) = ~ t-hat is, the ratio of correct predictions over all records 

for which a prediction is made. 

The confidence is measured with respect to the database D that is used for 
training. Often, a user will assume that the resulting association rules provide 
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information on the process that generated the database which will be valid in 
future, too. But the confidence on the training data is only an estimate of the 
rules’ accuracy in the future, and since we search the space of association rules 
to maximize the confidence, the estimate is optimistically biased. We define the 
predictive accuracy c( [a; => y] ) of a rule as the probability of a correct prediction 
with respect to the process underlying the database. 

Definition 1. Let D be a database the reeords r of whieh are generated by a 
statie proeess P, let [a; => y] be an assoeiation rule. The predietive aeeuraey 
c{[x => y]) = Pr[r satisfies y\r satisfies x] is the eonditional probability of y Qr 
given that x Q r when the distribution of r is governed by P. 

The confidence c([a; =>?/]) is the relative frequency of probability c([a: => y\) for 
given database D. We now pose the n most accurate association rules problem. 

Definition 2. Given a database D (defined like above) and a set of database 
items oi through au, find n rules hi, . . . ,hn £ {[a; => y]\x, y C . . . , a^}; y yf 
0; a; n 2 / = 0} which maximize the expected predictive accuracy c{[x y]). 

We formulate the problem such that the algorithm needs to return a fixed 
number of best association rules rather than all rules the utility of which ex- 
ceeds a given threshold. We think that this setting is more appropriate in many 
situation because a threshold may not be easy to specify and a user may not be 
satisfied with either an empty or an outrageously large set of rules. 

3 Bayesian Frequency Correction 

In this section, we analyze how confidence and support contribute to the pre- 
dictive accuracy. The intuitive idea is that we “mistrust” the confidence a little. 
How strongly we have to discount the confidence depends on the support - the 
greater the support, the more closely does the confidence relate to the predictive 
accuracy. In the Bayesian framework that we adopt, there is an exact solution as 
to how much we have to discount the confidence. We call this approach Bayesian 
frequency correction since the resulting formula (Equation 0 takes a confidence 
and “corrects” it by returning a somewhat lower predictive accuracy. 

Suppose that we have a given association rule [x y] with observed con- 
fidence c([x => y]). We can read p(c([x => ?/]|c([x => ?/]),s(x)) as “P(predictive 
accuracy of [x => y] given confidence of [x => y] and support of x)” . The intuition 
of our analysis is that application of Bayes’ rule implies “P(predictive accuracy 
given confidence and support) = P(confidence given predictive accuracy and 
support)P(predictive accuracy)/ normalization constant”. Note that the likeli- 
hood P(c|c, s) is simply the binomial distribution. (The target attributes of each 
record that is satisfied by x can be classified correctly or erroneously; the chance 
of a correct prediction is just the predictive accuracy c; this leads to a binomial 
distribution.) “P(predictive accuracy)”, the prior in our equation, is the accu- 
racy histogram over the space of all association rules. This histogram counts, for 
every accuracy c, the fraction of rules which possess that accuracy. 
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In Equation we decompose the expectation by integrating over all pos- 
sible values of c([a; => y]). In Equation |3 we apply Bayes’ rule. 7r(c) = 
I { I )-c} | accuracy histogram. It specifies the probability of draw- 

ing an association rule with accuracy c when drawing at random under uniform 
distribution from the space of association rules of length up to A:. 

E{c{[x ^ y])\c{[x 2 /]),s(x)) 

cp{c{[x ^ y]) = c\c{[x ^ t/]), s(a:))dc 

P{c{[x ^ y])\c{[x ^ y]) = c, s(a;))7r(c) 

P{c{[x ^ y])\s{x)) 

In Equation 0 we apply Equation |2| Since, over all c, the distribution p{c{[x 
y]) = c|c([a; => t/]),s(a;)) has to integrate to one (Equation 0, we can treat 
P{c{[x => y])|c([a; => y]), s(a^)) as a normalizing constant which we can determine 
uniquely in Equation 0 





( 1 ) 

( 2 ) 



J P(c(lx ^ y]) = c|c([a; ^ y]),s{x))dc = 1 

f P(c([a; ^ y])|c([a: ^ y]) = c, s(a;))7r(c) ^ 

J P{c{[x ^ y]Mx)) 

P{c{[x y]) |s(a;)) = J P{c{[x y]) |c([x ^ y]) = c, s{x))Tr{c)dc 



( 3 ) 

( 4 ) 

( 5 ) 



Combining Equations 0 and O we obtain Equation El In this equation, we also 
state that, when the accuracy c is given, the confidence c is governed by the 
binomial distribution which we write as B[c, s](c). This requires us make the 
standard assumption of independent and identically distributed instances. 



E{c{[x => y])\c{[x y]),s(a;)) 



/ cB[c, s(x)](c([a: ^ y]))7r(c)dc 
/ B[c, s{x)]{c{[x y]))Tr{c)dc 



(6) 



We have now found a solution that quantifies E{c{[x => y])|c([x => y]),s(x)), 
the exact expected predictive accuracy of an association rule [x y] with given 
confidence c and body support s(x). Equation Elthus quantifies just how strongly 
the confidence of a rule has to be corrected, given the support of that rule. Note 
that the solution depends on the prior 7t(c) which is the histogram of accuracies 
of all association rules over the given items for the given database. 

One way of treating such priors is to assume a certain standard distribution. 
Under a set of assumptions on the process that generated the database, 7t(c) 
can be shown to be governed by a certain binomial distribution 0. However, 
empirical studies (see Sect. Eland Fig. Et) show that the shape of the prior can 
deviate strongly from this binomial distributions. Reasonably accurate estimates 
can be obtained by following a Markov Chain Monte Carlo 0 approach to 
estimating the prior, using the available database (see Sect. EJ. For an extended 
discussion of the complexity of estimating this distributions, see jDIti] . 
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predictive accuracy 




Fig. 1. Contributions of support s(a;) and confidence c{[x y]) to predictive accuracy 
c{[x => t/]) of rule [x => y] 



Example Curve. Figured shows how expected predictive accuracy, confidence, 
and body support relate for the database that we also use for our experiments in 
Sect. El using 10 items. The predictive accuracy grows with both confidence and 
body support of the rule. When the confidence exceeds 0.5, then the predictive 
accuracy is lower than the confidence, depending on the support and on the 
histogram tt of accuracies of association rules for this database. 

4 Discovery of Association Rules 

The Apriori algorithm d finds association rules in two steps. First, all item 
sets X with support of more then the fixed threshold “minsup” are found. Then, 
all item sets are split into left and right hand side x and y (in all possible 
ways) and the confidence of the rules [a; y\ is calculated as All rules 

with a confidence above the confidence threshold “minconf” are returned. Our 
algorithm differs from that scheme since we do not have fixed confidence and 
support thresholds. Instead, we want to find the n best rules. 

In the first step, our algorithm estimates the prior 7r(c). Then generation of 
frequent item sets, pruning the hypothesis space by dynamically adjusting the 
minsup threshold, generating association rules, and removing redundant associ- 
ation rules interleave. The algorithm is displayed in Tabled 

Estimating 7r(c). We can estimate tt by drawing many hypotheses at random 
under uniform distribution, measuring their confidence, and recording the re- 
sulting histogram. Algorithmically, we run a loop over the length of the rule 
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Table 1. Algorithm PredictiveApriori: discovery of n most predictive association rules 



1. Input: n (desired number of association rules), database with items oi, . . . ,afe. 



2 . 

3. 



4. 

5. 

6 . 



7. 



Let T = 1. 



For i = 1 .. .k Do: Draw a number of association rules [x y\ with i items at 
random. Measure their conhdence (provided s{x) > 0). Let TTi(c) be the distribution 
of conhdences. 



For all c, Let 7 r(c) = 






Let Xq = {0}; Let Xi = {{ai}, ■ ■ ■ , {^fc}} be all item sets with one single element. 

For i = 1 . . . k — 1 While (i = 1 or Ai_i 7 ^ 0). 

a) If i > 1 Then determine the set of candidate item sets of length i as Xi = 
{x U x'\x,x' G Xi-i,\x U x'\ = i}. Generation of Xi can be optimized by 
considering only item sets x and x' € Ai_i that differ only in the element 
with highest item index. Eliminate double occurrences of item sets in Xi. 

b) Run a database pass and determine the support of the generated item sets. 
Eliminate item sets with support less than r from Xi. 

c) For all x £ Xi Call RuleGen(a:). 

d) If best has been changed, Then Increase t to be the smallest number such 
that E(c|l,r) > E{c{best[n])\c{best[n], s{best[n])) (refer to Equation 0. If r > 
database size, Then Exit. 

e) If T has been increased in the last step. Then eliminate all item sets from Xi 
which have support below r. 

Output best[l] . . . best[n], the list of the n best association rules. 



Algorithm RuleGen(a;) (generate all rules with body x) 

10. Let 7 be the smallest number such that E(c| 7 /s(a;), s(a;)) > 

E {c{best [n] ) | c( 6 est [n] , s (best[n ] ) ) . 

11. For i = 1 .. .k With ai ^ X Do (for all items not in x) 

a) If i = 1 Then Let Yi = {{ai}|ai ^ x} (item sets with one element not in x). 

b) Else Let Yi = {yU y'\y, y' G Fi-i, \yVJy'\ =1} analogous to the generation of 
candidates in step 

c) For all y £ Yi Do 

i. Measure the support s{x U y). If s{x U j/) < 7 , Then eliminate y from Yi 
and Continue the for loop with the next y. 

ii. Equation Ogives the predictive accuracy E{c{[x => y\)\s(xVJy) / s{x) , s(a:)). 

iii. If the predictive accuracy is among the n best found so far (recorded 
in best), Then update best, remove rules in best that are subsumed 
by other rules, and Increase 7 to be the smallest number such that 
E{c\y/s{x),s{x)) > E{c(best[n])\c{best[n], s(best[n])). 

12. If any subsumed rule has been erased in step ll(c)iii. Then recur from step 10. 
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and, given that length, draw a fixed number of rules. We determine the items 
and the split into body and head by drawing at random (StepOI). We have now 
drawn equally many rules for each size while the uniform distribution requires 
us to prefer long rules as there are many more long rules than there are short 
ones. There are (^) item sets of size i over k database items, and given i items, 
there are 2 * — 1 distinct association rules (each item can be located on the left or 
right hand side of the rule but the right hand side must be nonempty). Hence, 
Equation 0 gives the probability that exactly i items occur in a rule which is 
drawn at random under uniform distribution from the space of all association 
rules over k items. 



P[i items] 






( 7 ) 



In step 0 we apply a Markov Chain Monte Carlo style correction to the prior by 
weighting each prior for rule length i by the probability of a rule length of i. 



Enumerating Item Sets with Dynamic Minsup Threshold. Similarly to 
the Apriori algorithm, the PredictiveApriori algorithm generates frequent item 
sets, but using a dynamically increasing minsup threshold r. Note that we start 
with size zero (only the empty item set is contained in Aq). X± contains all 
item sets with one element. Given Ai_i, the algorithm computes Xi in step Ibal 
just like Apriori does. An item set can only be frequent when all its subsets are 
frequent, too. We can thus generate Xi by only joining those elements of Ai_i 
which differ exactly in the last element (where last refers to the highest item 
index). Since all subsets of an element of Xi must be in Xi_i, the subsets that 
result from removing the last, or the last but one element must be in Xi-i, too. 
After running a database pass and measuring the support of each element of Xi, 
we can delete all those candidates that do not achieve the required support of r. 

We then call the RuleGen procedure in step Ecl that generates all rules over 
body X, for each x € Xi. The RuleGen procedure alters our array best[l ...n] 
which saves the best rules found so far. In step M we refer to best[n], meaning 
the nth best rule found so far. We now refer to Equation El again to determine 
the least support that the body of an association rule with perfect confidence 
must possess in order to exceed the predictive accuracy of the currently nth best 
rule. If that required support exceeds the database size we can exit because no 
such rule can exist. We delete all item sets in step Eil which lie below that new 
r. Finally, we output the n best rules in stepQ 



Generating All Rules over Given Body x. In step 10, we introduce a new 
accuracy threshold 7 which quantifies the confidence that a rule with support 
s(x) needs in order to be among the n best ones. We then start enumerating all 
possible heads y, taking into account in step 11 that body and head must be 
disjoint and generating candidates in step 11(b) analogous to step El] In step 
ll(c)i we calculate the support oi x\Jy for all heads y. When a rules lies among 
the best ones so far, we update best. We will not bother with rules that have 
a predictive accuracy below the accuracy of best[n], so we increase 7 . In step 
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we delete rules from best which are subsumed by other rules. This may 
result in the unfortunate fact that rules which we dropped from best earlier, now 
belong to the n best rules again. So in step ll(c)iii we have to check this and 
recur from step 10 if necessary. 

Removing Redundant Rules. Consider an association rule [a => c,d]. When 
this rule is satisfied by a database, then that database must also satisfy [a, b 
c, d], [a => c], [a ^ d], and many other rules. We write [x ^ y] ^ [x' ^ y'] to 
express that any database that satisfies [x y] must also satisfy [x' y'] . Since 

we can generate exponentially many redundant rules that can be inferred from 
a more general rule, it is not desirable to present all these redundant rules to 
the user. Consider the example in Table |2] which shows the five most interesting 
rules generated by PredictiveApriori for the purchase database that we study 
in Sect. 0 The first and second rule in the bottom part are special cases of the 
third rule; the fourth and fifth rules are subsumed by the second rule of the top 
part. The top part shows the best rules with redundant variants removed. 

Theorem 1. We can decide whether a rule subsumes another rule by two simple 
subset tests: [a; => y] |= [x' ^ y'] ^ x C x' A y ^ y' . Moreover, if [x y] is 
supported by a database D, and [x => y] ^ [x' y'] then this database also 

supports [x' => y']. 

Proofs of Theorems [D and 0 are left for the full paper. Theorem 0 says that 
[x y] subsumes [x' y'] if and only if x is a subset of x' (weaker precondition) 

and y is a, superset of y' {y predicts more attribute values than y'). We can then 
delete \x' => y'] because Theorem 0 says that from a more general rule we can 
infer that all subsumed rules must be satisfied, too. In order to assure that the n 
rules which the user is provided are not redundant specializations of each other, 
we test for subsumption in step 11(c) hi by performing the two subset tests that 
imply subsumption according to Theorem 0 

Theorem 2. The PredictiveApriori algorithm (Tabled returns n association 
rules [xi => yi] with the following properties, (i) For all returned solutions [x 
y], [x' y']: [x ^ y] ^ [x' ^ y']. (ii) Subject to constraint (i), the returned 

rules maximize E{c[xi => yi]\c{[xi yi]),s{x)) according to Equation\^ 



Improvements. Several improvements of the Apriori algorithm have been sug- 
gested that improve on the PredictiveApriori algorithm as well. The AprioriTid 
algorithm requires much fewer database passes by storing, for each database 
record, a list of item sets of length i which this record supports. From these 
lists, the support of each item set can easily be computed. In the next itera- 
tion, the list of item sets of length * -|- 1 that each transaction supports can 
be computed without accessing the database. We can expect this modification 
to enhance the overall performance when the database is very large but sparse. 
Further improvements can be obtained by using sampling techniques {e.g., (m). 
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Table 2. (Top) five best association rules when subsumed rules are removed; (bottom) 
five best rules when subsumed rules are not removed 



[ PanelID=9 ProductGroup=84 ] 

E{c\c= l,s = 10000) = 1 

[ Location=market_4 ^ PanelID=9, ProductGroup=84, Container=nonreuseable ] 
E{c\c= l,s = 1410) = 1 

[ Location=market_6 ^ PanelID=9, ProductGroup=84, Container=nonreuseable ] 
E{c\c= l,s = 1193) = 1 

[ Location=market_l ^ PanelID=9, ProductGroup=84, Container=nonreuseable ] 
E{c\c= l,s = 1025) = 1 

[ Manufacturer=producer_18 ^ PanelID=9, ProductGroup=84, Type=0, Con- 
tainer=nonreuseable ] 

E{c\c= l,s = 1804) = 1 



[ => PanelID=9 ] 

[ => ProductGroup=84 ] 

[ PanelID=9, ProductGroup=84 ] 

[ Location=market_4 ^ PanelID=9 ] 

[ Location=market_4 ProductGroup=84 ] 



E{c\c= l,s = 10000) = 1 
E{c\c= l,s = 10000) = 1 
E{c\c= l,s = 10000) = 1 
E{c\c= l,s = 1410) = 1 
E{c\c= l,s = 1410) = 1 



5 Experiments 

For our experiments, we used a database of 14,000 fruit juice purchase transac- 
tions, and the mailing campaign data used for the KDD cup 1998. Each trans- 
action of the fruit juice dtabase is described by 29 real valued and string valued 
attributes which specify properties of the purchased juice as well as attributes 
of the customer {e.g., age and job). By binarizing the attributes and considering 
only a subset of the binary attributes, we varied the number of items during the 
experiments. For instance, we transformed the attribute “ContainerSize” into 
five binary attributes, “ContainerSize < 0.3”, “0.3 < ContainerSize < 0.5”, etc. 

Figure Si shows the prior 7t(c) as estimated by the algorithm in step S for 
several numbers of items. Figure S shows the predictive accuracy for this prior, 
depending on the confidence and the body support. Table Q (top) shows the five 
best association rules found for the fruit juice problem by the PredictiveApriori 
algorithm. The rules say that all transactions are performed under PanellD 9 
and refer to product group 84 (fruit juice purchases). Apparently, markets 1, 4, 
and 6 only sell non-reuseable bottles (in contrast to the refillable bottles sold by 
most german supermarkets). Producer 18 does not sell refillable bottles either. 

In order to compare the performance of Apriori and PredictiveApriori, we 
need to find a uniform measure that is independent of implementation details. 
For Apriori, we count how many association rules have to be compared against 
the minconf threshold (this number is independent of the actual minconf thresh- 
old) . We can determine this number from the item sets without actually enumer- 
ating all rules. For PredictiveApriori, we measure for how many rules we need 
to determine the predictive accuracy by evaluating Equation El 
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Fig. 2. (a) Confidence prior tt for various numbers of items, (b) Number of rules that 
PredictiveApriori has to consider dependent on the number n of desired solutions 
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Fig. 3. Time complexity of PredictiveApriori and Apriori, depending on the number 
of items and (in case of Apriori) of minsup (a) fruit juice problem, (b) KDD cup 1998 



The performance of Apriori depends crucially on the choice of the support 
threshold “minsup”. In Fig.j^ we compare the computational expenses imposed 
by PredictiveApriori (10 best solutions) to the complexity of Apriori for several 
different minsup thresholds and numbers of items for both the fruit juice and the 
KDD cup database. The time required by Apriori grows rapidly with decreasing 
minsup values. Among the 25 best solutions for the juice problem found by Pre- 
dictiveApriori we can see rules with body support and confidence of 92. In order 
to find such special but accurate rules, Apriori would run many times as long as 
PredictiveApriori. Figure Eb shows how the complexity increases with the num- 
ber of desired solutions. The increase is only sub- linear. Figure 0 shows extended 
comparisons of the Apriori and PredictiveApriori performance for the fruit juice 
problem. The horizontal lines show the time required by PredictiveApriori for 
the given number of database items (n = 10 best solutions). The curves show 
how the time required by Apriori depends on the minsup support threshold. 
Apriori is faster for large thresholds since it then searches only a small fraction 
of the space of association rules. 
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Fig. 4. Number of rules that PredictiveApriori and Apriori and need to consider, de- 
pending on the number of items (in case of Apriori also depending on minsup) 



6 Discussion and Related Results 

We discussed the problem of trading confidence of an association rule against 
support. When the goal is to maximize the expected accuracy on future database 
records that are generated by the same underlying process, then Equation|Slgives 
us the optimal trade-off between confidence and support of the rule’s body. Equa- 
tion 0 results from a Bayesian analysis of the predictive accuracy; it is based on 
the assumption that the database records are independent and identically dis- 
tributed and requires us to estimate the confidence prior. The PredictiveApriori 
algorithm does this using a MCMC approach 0. 

The Bayesian frequency correction approach that eliminates the optimistical 
bias of high confidences relates to an analysis of classification algorithms 0 
that yields a parameter-free regularization criterion for decision tree algorithms 
m- The PredictiveApriori algorithm returns the n rules which maximize the 
expected accuracy; the user only has to specify how many rules he or she wants 
to be presented. This is perhaps a more natural parameter than minsup and 
minconf, required by the Apriori algorithm. 

The algorithm also checks the rules for redundancies. It has a bias towards 
returning general rules and eliminating all rules which are entailed by equally 
accurate, more general ones. Guided by similar ideas, the Midos algorithm H21 
performs a similarity test for hypotheses. In a rule discovery algorithm 
is discussed that selects from classes of redundant rules the most simple, rather 
than the most general ones. For example, given two equally accurate rules [a b] 
and [a => b, c] PredictiveApriori would prefer the latter which predicts more 
values whereas would prefer the shorter first one. 

The favorable computational performance of the PredictiveApriori algorithm 
can be credited to the dynamic pruning technique that uses an upper bound on 
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the accuracy of all rules over supersets of a given item set. Very large parts of 
the search space can thus be excluded. A similar idea is realized in Midos m- 
Many optimizations of the Apriori algorithm have been proposed which have 
helped this algorithm gain its huge practical relevance. These include the Apri- 
oriTid approach for minimizing the number of database passes and sampling 
approaches for estimating the support of item sets Eim . In particular, efficient 
search for frequent itemsets has been addressed intensely and successfully |71 
Ed. Many of these improvements can, and should be, applied to the Predic- 
tive Apriori algorithm as well. 
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Abstract. This paper presents a novel decision-tree induction for a 
multi-objective data set, i.e. a data set with a multi-dimensional class. 
Inductive decision-tree learning is one of the frequently-used methods 
for a single-objective data set, i.e. a data set with a single-dimensional 
class. However, in a real data analysis, we usually have multiple ob- 
jectives, and a classifier which explains them simultaneously would be 
useful and would exhibit higher readability. A conventional decision-tree 
inducer requires transformation of a multi-dimensional class into a single- 
dimensional class, but such a transformation can considerably worsen 
both accuracy and readability. In order to circumvent this problem we 
propose a bloomy decision tree which deals with a multi-dimensional 
class without such transformations. A bloomy decision tree has a set of 
split nodes each of which splits examples according to their attribute 
values, and a set of flower nodes each of which predicts a class dimen- 
sion of examples. A flower node appears not only at the fringe of a tree 
but also inside a tree. Our pruning is executed during tree construction, 
and evaluates each class dimension based on Cramer’s V. The proposed 
method has been implemented as D3-B (Decision tree in Bloom), and 
tested with eleven data sets. The experiments showed that D3-B has 
higher accuracies in nine data sets than C4.5 and tied with it in the 
other two data sets. In terms of readability, D3-B has a smaller number 
of split nodes in all data sets, and thus outperforms C4.5. 



1 Introduction 

Given a set of training examples, learning from examples aims at constructing 
a classifier which predicts the class of an unseen example. Here, learning from 
examples assumes that each example has a single-dimensional class, and can thus 
be called as single-objective. Inductive decision-tree learning mm has been 
successfully used in various fields as one of the most useful methods in learning 
from examples. 

In dealing with real data, however, we often have multiple objectives, and may 
wish to predict a multi-dimensional class m- Building a separate decision tree 
for each objective would be problematic in terms of readability because decision 
trees differ in their structures and in their split attributes, and are thus difficult 
to be compared. A single classifier which predicts this multi-dimensional class 
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would be more useful. For example, suppose an analyst constructing a classifier 
from an agricultural data set about various crops. Rather than having a decision 
tree which predicts only corn, the analyst would prefer a decision tree which 
predicts corn and wheat simultaneously since it would be comprehensible. 

In such a case, a conventional decision-tree learning algorithm can construct 
a classifier if the multi-dimensional class is transformed into a single-dimensional 
class. This idea is described in m briefly without experimental justificatioiQ. 
However, a transformation without loss of information, such as assigning a new 
class value to each combination of class values, considerably increases the num- 
ber of class values. This tendency causes a fragmentation problem dH: each 
class value has only a few training examples in a split node at the bottom of 
a decision tree, and appropriate selection of an attribute would be difficult. A 
transformation with loss of information such as principle component analysis jOJ 
Cj could overlook useful knowledge. 

In order to circumvent this problem we propose a bloomy decision tree in 
which each class dimension is independently predicted by a flower node. Since 
a flower node can be constructed inside a tree, the number of class dimensions 
gradually decreases as an example descend the tree. This corresponds to coping 
with the fragmentation problem by simplifying the classification task in order 
to construct a small decision tree with high accuracy. We have implemented an 
induction algorithm of bloomy decision trees as D3-B (Decision tree in Bloom), 
and demonstrate its effectiveness as a multi-objective classifier with eleven data 
sets. 

2 Decision Tree 

In this section, we give a simple explanation of inductive decision-tree learning 
ETTUim . For various problems in this field, please refer to a recent survey 0. 



2.1 Construction of a Decision Tree 

A decision tree represents a tree-structured classifier which consists of a set of 
nodes and a set of edges. A node is either a split node which tests an attribute 
or a leaf node which predicts a class of an example. Given an unseen example, 
a split node assigns the example to one of its subtrees according to the value of 
its attribute, and a leaf node predicts the class value of the example. 

The input to a decision-tree inducer is a set E of examples. An example 
has n attribute values «ii, 02i, • • • , a„i for attributes ai,a2,---,a„ and a class 
value Ci to a class c, and is represented as = (oii, 02i, • • • , a„i, Cj). 

^ Caruana formalized a learning problem with multiple objectives as multitask learning 
m, which includes our multi-objective classihcation. He almost exclusively worked 
with neural networks, and only dropped a few remarks about decision trees. His 
remarks mainly concern proposal of novel split criteria, and no proposals are given 
for knowledge representation. 
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A decision tree is typically constructed by a recursive split of example space 
with a divide and conquer method. The split typically employs greedy search, 
and, for each split node, the best attribute is selected as the attribute of the 
node based on an evaluation function. We will explain a typical function in 
Sect. 12. 21 The class value of a leaf node is determined if all training examples 
in the node have the same class value. If a leaf node has no training examples, 
the most frequent class value of examples in its parent is assigned as the class 
value of the leaf. If the training examples in a split node can no longer be split, 
the node becomes a leaf node and the most frequent class value of examples in 
the node is assigned as the class value of the leaf. Here, a decision tree which 
perfectly predicts the class of a training example tends to perform poorly for 
test examples. A procedure called pruning replaces a subtree which is judged 
irrelevant in the prediction with a leaf node. We will explain a pruning method 
in Sect. i2.;-a 



2.2 Attribute Selection 



Here, we explain gain ratio |1 Oil 1 1 as one of the evaluation criteria. Gain ratio 
G(a, c, E) is a criterion based on mutual entropy of an attribute a and the class 
c for a set E of examples. Let \E\, Ea=i, and Ec=i be the number of examples in 
E, the set of examples each of which satisfies a = i in E, and the set of examples 
each of which satisfies c = i in E respectively, then 



where 



r<( _ H{c,E) - J{a,c,E) 

H{a.E) 


( 1 ) 


H{c.E)= •£ 1^1 log,(^ 1^1 j 


( 2 ) 


J{a,c,E) ='£^-^H{c,Eg^i) 

i 


( 3 ) 



2.3 Pruning 



Pruning can be classified as either pre-pruning, which is executed during tree 
construction, or post-pruning, which is executed after tree construction mni 
m Compared with post-pruning, pre-pruning is time-efficient since it does not 
require construction of a complete tree. However, experimental evidence shows 
that pre-pruning leads to low accuracy jS|, and most decision-tree inducers em- 
ploy post-pruning. 

Pruning based on is employed in decision-tree inducers such as IDS uni- 
This method first calculates a value y^(a, c, E) of an attribute a and the class 
c for a set E of examples in a node. Let Ea=i^c=j be the set of examples each of 
which satisfies both a = i and c = j in E, then 



X 



'{a,c,E) = 



{\Eg^i^c=j\ eg—i^c=j{E)Y 



( 4 ) 
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where 







( 5 ) 



Let Xr(*^) be a value with a degree of freedom r and a significance level a, 
then when x^(o, c, E) is smaller than a threshold Xr{(^)t the attribute a and the 
class c are considered to have no relevance, and the split node is replaced by a 
leaf node. A shortcoming of this approach is that x^(a, c, if) tends to be overly 
large when the number of examples in the training set is large, and a decision 
tree is often under-pruned jS|. 



3 Bloomy Decision Tree for Multi-objective Classification 

3.1 Bloomy Decision Tree 

In a data set E for multi-objective classification, i.e. with a multi-dimensional 
class, each example has n attribute values aii, 021, • • • , ani for attributes oi, 02, 

• • • , On, and m class values {cu, C2i, • • • , Cmi) for a m-dimensional class (ci, C2, • • • , 
Cm) ■ 

The problems in Sect. E have led us to invent a bloomy decision tree for 
multi-objective classification. In a decision tree for multi-objective classification, 
several class dimensions can be predicted accurately even at an internal node. 
A bloomy decision tree predicts such dimensions in an internal node which is 
called a flower node, and typically reduces the number of class dimensions to be 
predicted as an example descends the tree from its root to one of its leaves. This 
corresponds to simplifying a multi-objective classification downward a decision 
tree, and can be considered as an efficient solution to the fragmentation problem 
described in Sect.^ A bloomy decision tree is expected to show high accuracy 
with a simpler structure compared with a conventional decision tree. A flower 
node corresponds to a leaf node which predicts a set of class dimensions, and 
appears not only at the fringe of a tree but also inside a tree. 

Similar to a conventional decision tree, a bloomy decision tree T has a recur- 
sive tree structure. A node iV of a bloomy decision tree T is classified as either a 
flower node A^bioom which predicts values of a set of class dimensions, or a split 
node A"spiit which splits examples according to their attribute values. Figure E 
shows an example of a bloomy decision tree for a 2 -dimensional class (ci,C2), 
where an oval and a rectangle represent a flower node and a split node respec- 
tively. Note that a flower node appears inside the tree since a class dimension is 
predicted as ci = p for examples each of which satisfies Attributel =Y. 

A root node of a bloomy decision tree is a split node. A split node fVsput has 
an attribute a which is selected according to a procedure in the next section. Let 
the number of values for an attribute a he v a, then there are Va child nodes for 
A^spiit, and each child node is assigned a subset Ea^at where at is the i-th value 
of a. A child node of fVgpiit is either a split node or a flower node. 

A flower node fVbioom consists of Z (< m) petals pi,p2,- " jPi each of which 
predicts a class dimension. Alternatively, a petal pi represents that a predicted 
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Attribute 1 




Fig. 1. Example of a bloomy decision tree 



value is assigned to a class dimension Cj. The predicted value of Cj is fixed in 
the petal and remain unchanged in the child nodes of Abloom- If some of the 
class dimensions remain unpredicted in Abloom, Abloom has a child node which 
is a split node. Note that a flower node can be an internal node as well as a leaf 
node. 

3.2 Attribute Selection 

Similar to the construction of a decision tree, we employ a divide and conquer 
method based on an attribute selection function F"(a, E). Gain ratio presented in 
Sect. 12.21 is an evaluation function for a single-dimensional class, and cannot be 
employed without modification. In this paper, we employ the add-sum F(a,E) 
of gain ratio G(a, Cj, E) for each class dimension Cj. 



Given a set E of training examples, the attribute a which maximizes A(a, E) is 
selected in a split node. We call this approach as the add-sum criterion. 

Instead of using the add-sum of gain ratios, we can also consider their prod- 
uct or the add-sum of their squares, which we call the product criterion and 
the squares-sum criterion respectively. However, the product criterion would be 
overly pessimistic, avoiding an attribute which has at least a nearly-zero gain 
ratio. The squares-sum criterion, on the other hand, would be overly optimistic, 
preferring an attribute which relies on a few large gain ratios. The former ne- 
glects an attribute which works well for a subset of dimensions, and the latter 
criterion is typically dominated by outliers. Therefore, we use the add-sum cri- 
terion in our approach. Note that this analysis could be justified experimentally 
under various settings, but we leave this for future work due to space constraint. 




( 6 ) 



3.3 Pruning 

In order to obtain an accurate classifier for all class dimensions, each dimension 
should be evaluated independently. Our method employs pre-pruning for each 
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dimension immediately after constructing a split node, and assigns a predicted 
value for a dimension which is pruned. 

As explained in Sect. 12.111 pruning tends to produce an overly large decision 
tree when the number of examples in the training set is large. Therefore, we use 
Cramer’s C [7| to cope with this problem. Cramer’s V V{a,cj,E) is an index of 
relevance of an attribute a and a class dimension Cj for a set E of examples. 



V{a,Cj,E) 



x'^{a,Cj,E) 
\E\{mm{va,q{cj)) - 1) 



(7) 



where q{cj) is the number of values of Cj . This index satisfies 0 < V{a,Cj,E) < 1. 

Since this index is, unlike x^, simply employed to compare its value and 
has no theoretical interpretation, we use the following value Vr{a, Cj, Ez, a) as a 
threshold for pruning. 



Vr{a,Cj,Ez,a) 



Xr(«) 

\Ez\{min{va, q{cj)) - 1 ) 



( 8 ) 



where \Ez\ is the expected number of examples assigned to the split node. Let 
|ifp| and |iVpc| be the number of examples in the parent split node and the 
number of child nodes of the parent split node respectively, then 



\Ez\ = 



\Ep\ 

|Afpc| 



(9) 



For a root node, we define that lifpl is equivalent to the number of training 
examples in the data set. In our method, a split node is pruned with respect to 
a dimension Cj if V (a, Cj,E) < Vr{a, Cj,Ez,a), and decision-tree construction is 
continued with the remaining dimensions of the class. 

As explained in Sect. I2.;il post-pruning produces an accurate tree but is 
time-consuming j^]. Our method, unlike a conventional decision-tree inducer, 
continues construction of a decision tree with the remaining dimensions even af- 
ter pruning. Therefore, we do not employ post-pruning since it requires iterations 
of construction and pruning, and is thus time-consuming. 



4 D3-B 

4.1 Construction of a Bloomy Decision Tree 

We have implemented our method as D3-B. Its algorithm is shown below, where 
each attribute and each class dimension is assumed to have a discrete value. 
Given a set E of examples, D3-B outputs a bloomy decision tree T using D3- 
B(training set). 

Given a set E of examples, algorithm D3-B recursively constructs a bloomy 
decision tree T with a divide and conquer method. If the training set A in a 
node can no longer be divided, we add, to T, a flower node which predicts all 
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dimensions in E. We define that a node can no longer be divided if and only if at 
least one of the following conditions hold: 1) no class dimension in E, 2) values 
of each class dimension are identical, or 3) E is empty. A predicted value is the 
majority of if if if is non-empty, otherwise the majority of the training set. 

If E can be divided, an attribute a is first selected according to the proce- 
dure described in Sect. Next, based on the pruning procedure which will 
be described in the next section, a set P of class dimensions to be pruned are 
obtained. If P is non-empty, we add, to T, a flower node which predicts the class 
dimensions in P, and those class dimensions are deleted from E. From Sect. LS. ll 
in this case, there is only one child node, which will be constructed by D3-B{E). 
For instance, in Fig. Q ci was pruned at the left child of the root node. If P is 
empty, child nodes are constructed by D3-B{Ea=v) for each value v of a. 

Algorithm: D3-B{E) 

Input: data set E 

Return value: bloomy decision tree T 

begin 

If ((if can no longer be split) and (if has a class dimension)) 

Add a flower node which predicts all class dimensions in if to T 

Else 

begin 

a ^ argmax E{a', E) 

a' 

P Prune{E,a) 

If(P is non-empty) 

begin 

Add a flower node which predicts class dimensions in P to T 
Delete class dimensions in P from E 
Construct a child node of T by D3-B{E) 

end 

Else 

Foreach(value v of attribute a) 

Construct child nodes of T by D3-B{Ea=v) 

end 

Return T 
end 

4.2 Pruning of a Bloomy Decision Tree 

Given a set E of examples and a selected attribute a, our algorithm returns a 
set P of class dimensions each of which should be predicted in the node. For 
each dimension Ci in if, the following procedure checks the pruning condition 
explained in Sect. I, 4., 4 1 and adds the class dimension if it satisfies the condition 
to P. This procedure is done in lexical order of class dimensions for simplicity, 
and a class dimension Ci which satisfies the condition is not employed in the 
calculation of V{a,Cj,E) and Vr{a,Cj,Ez,a) for a subsequent class dimension 
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Procedure: Prune(i?,a) 

Input: data set E, selected attribute a 
Return value: a set of class dimensions P 
begin 
P ^ (j) 

Foreach (class dimension Ci in E) 

If(y(a,Ci,^;) < K(a,Ci,^;z,a)) 

begin 

P P U {ci} 

Ci is not employed in the calculation of V{a,Cj,E) and Vr{a,Cj, Ez,a) for a 
subsequent class dimension Cj 

end 

Return P 
end 

5 Experimental Evaluation 

5.1 Conditions of Experiments 

We demonstrate the effectiveness of D3-B as a multi-objective classifier by ex- 
periments with eleven data sets. In the rest of this paper, thresholds for pruning 
were settled with r = 1 and a = 5%. 

“Agriculture” is a series of data sets which describe agricultural statistics for 
3,246 municipalities in Japan. In this experiment, we employed the 1991 version. 
Each example is represented by 37 attributes such as areas, populations, finan- 
cial statistics, and industrial statistics; and has a 25-dimensional class about 
gross products of crops. A continuous attribute was first discretized with equal- 
frequency method of five bins. We employed a simple class-blind discretization 
method, rather than a class-driven discretization method, for the sake of sim- 
plicity and speed-up. Before discretization, we visualized distributions of values 
for several attributes, and chose the number five arbitrarily. A continuous class 
dimension was first discretized with equal-frequency method of two bins. The 
number two was chosen after we observed poor performance of induction algo- 
rithms with five bins. We consider that information contained in this data set 
is insufficient in order to learn a classifier which correctly predicts a fine-defined 
multi-dimensional class. 

“Kiosk” is a data set which describes inventories about 52 kinds of merchan- 
dise in 232 shops of a Japanese company in 1994. In this data set, each kind 
of merchandise has at least 83 zeros. In discretizing a continuous attribute, we 
treated 0 as a value, and applied equal-frequency method of three bins to the 
other values. The number three was chosen because only a small number of 
examples were used in the discretizing procedure. 

The other nine data sets comes from the UCI Repository |p. We discretized a 
continuous attribute with equal- frequency method of four bins, where a missing 
value was left unchanged. The number four was chosen due to the wide variety 
of attributes concerning distributions of values: the distribution was relatively 
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balanced for some attributes such as those in “Agriculture” , and was skewed for 
other attributes such as those in “Kiosk” . 

For each data set, 100 multi-objective classification tasks were settled. For 
each classification task except for those with the agriculture data, we chose six 
attributes randomly, and regarded them as a 6-dimensional class. In choosing 
these attributes for UCI data sets, we considered their appropriateness as a class 
dimension. For each attribute, a single-objective classification task was settled, 
and an attribute with which C4.5P^’s accuracy is less than 63.5% with 5-fold 
cross-validation was ignored in selecting a class dimension. As the result, the 
Australian data and the mushroom data have only 28 (= sCe) possible sets of 6- 
class tasks, so we checked all these 28 tasks instead of randomly-chosen 100 tasks 
for these data sets. For the agriculture data, we employed the 25-dimensional 
class. Initial experiments revealed that 3.0 % difference for average accuracy and 
1.5 difference for average number of nodes can be each considered as significant. 

Note that, in practice, a class dimension should be settled in terms of its 
importance in the domain. An interesting research avenue would be to measure 
effectiveness of learning algorithms by constructing a class attribute with at- 
tributes that can be more naturally used as a class dimension, in the sense that 
their prediction is useful for the user. 

In order to evaluate the effectiveness of our approach, we compared D3-B 
with six learning algorithms including C4.5 and variants of D3-B. In the experi- 
ments, average accuracies for class dimensions and average number of split nodes 
were measured by 5-fold cross-validation as evaluation indices. First, C4.5 was 
chosen as the representative of conventional decision-tree inducers. In applying 
C4.5, a multi-dimensional class was transformed to a single-dimensional class by 
assigning a new class value to each combination of class values. Second, D3-B 
was also applied to data sets each of which was produced with this transfor- 
mation in order to evaluate the effectiveness of flower nodes. Third, in order to 
evaluate the effectiveness of Cramer’s V pruning, we also employed D3-B with 

pruning. Fourth and fifth, the add-sum criterion in Sect. 1^21 was evaluated 
by using D3-Bs with the product criterion and the squares-sum criterion. Sixth, 
we compared D3-B with a method which constructs a decision tree for each class 
dimension. 

5.2 Experimental Results 

Figure shows the accuracies and the numbers of split nodes of D3-B and 
C4.5. Concerning accuracy, our method outperforms C4.5 in nine data sets by 
3.4 % - 19.3 %, and approximately ties with it in the other two data sets (our 
advantage is less than 2.4 %). Concerning the number of split nodes, D3-B 
constructs smaller trees in all data sets by 9.1 - 452.9. This shows that our D3-B 
outperforms C4.5 in accuracy and readability due to its appropriateness to a 
multi-objective classification task. 

Figure Eb shows the effect of flower nodes on the accuracy and the number 
of split nodes. Concerning the accuracy, D3-B outperforms D3-B without flower 
nodes in “hepatitis”, “Australian”, and “German” by 7.9 % - 11.8 %; and ap- 
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(e) 



Fig. 2. Experimental results, (a) Accuracies and numbers of split nodes of D3-B and 
C4.5. For C4.5, the numbers of nodes for “vehicle”, “German”, and “agriculture” are 
109, 152, and 468 respectively, (b) Effect of flower nodes on the accuracies and the num- 
bers of split nodes, (c) Accuracies and numbers of split nodes of D3-Bs with Cramer’s V 
pruning and pruning. The numbers of nodes of the latter method for “Australian” , 
“credit”, “German”, and “mushroom” are 132, 125, 179, and 103 respectively, (d) Ac- 
curacies and numbers of split nodes of D3-Bs with the add-sum criterion, the product 
criterion, and the squares-sum criterion, (e) Comparison of D3-B and a method which 
constructs a decision tree for each class dimension. The numbers of nodes of the latter 
method for “housing”, “Australian”, “credit”, and “vehicle” are 101, 144, 143, and 119 
respectively. Note that this comparison should be treated differently since the latter 
method constructs multiple trees. 
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proximately ties with it in the rest (the difference is less than 1.6 %). Concerning 
the number of split nodes, our method has a smaller number in eight data sets 
by 3.5 - 45.5, and approximately ties with the other in the rest (the difference is 
less than 1.7). We can conclude that the use of flower nodes almost always im- 
proves readability and occasionally improves accuracy. This result is due to the 
fact that a bloomy decision tree, unlike a conventional decision tree, gradually 
simplifies a multi-objective classification task with flower nodes inside the tree. 

Figure Ut shows the influence of pruning methods on the accuracy and the 
number of split nodes. Since pruning is known to produce an overly large 
decision tree for a large data set, data sets on the horizontal axis are sorted 
in ascending order with respect to their numbers of examples. Concerning the 
accuracy, our method outperforms pruning in five data sets by 3.7 % - 12.4 
%, and approximately ties with it in the rest sbc (the difference is less than 1.1 
%). Concerning the readability, our method outperforms pruning in eight 
data sets (7.2 - 152.8 smaller), and approximately ties with it in the rest three 
(0.6 - 2.9 larger). These three data sets correspond to the second, the third, and 
the fourth smallest data sets. These facts show that our Cramer’s V pruning 
tolerates the ineffectiveness of pruning in readability when the training set is 
large. 

Figure Eli shows the effect of criteria on the accuracy and the number of split 
nodes. Concerning the readability, the product criterion produces trees with the 
smallest number of nodes in seven data sets by 7.5 - 24.4, which seems significant, 
and also trees with the smallest number of nodes in three data sets although the 
difference is smaller (less than 1.6). “Agriculture” is the only exception since the 
squares-sum criterion produces trees with a smaller number of nodes by 4.4. We 
attribute these results to the fact that the product criterion selects an attribute 
which splits all class dimensions well. We could not judge superiority between 
the other two criteria from these experiments. Concerning the accuracy, the add- 
sum criterion outperforms other criteria in all data sets. Especially, the difference 
seems significant in seven data sets (4.9% - 13.7%). We attribute these results to 
the fact that our criterion is, as we discussed in Sect. E3 robust concerning the 
distribution of gain ratios of class dimensions, and is thus adequate for prediction 
of multi-objective classification. 

Figure Eh shows comparison of D3-B and a method which constructs a deci- 
sion tree for each class dimension. We see that the differences of accuracies seem 
not significant since they are less than 2.7% in all data sets. For readability, 
however, D3-B always constructs a smaller tree (the difference is at least 6.6, 
which seems significant). We consider that these results are due to the fact that 
D3-B constructs a single tree while the other method constructs multiple trees. 
Moreover, as we mentioned in Sect^ readability is much worse than it appears 
in the latter method since analysis based on multiple trees is difficult. 

We also investigated these methods by varying the number m of class di- 
mensions from two to five. Our method typically has no clear advantage for 
m = 2, but gradually outperforms other methods as m increases. It should be 
noted that no clear difference was observed for relatively “easy” data sets such 
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as “housing”. We didn’t tried for m > 7 to avoid the effect that the number of 
attributes becomes smaller as m increases. 

From these results, the superiority of our method in accuracy and/or read- 
ability has been empirically proved against other methods. We consider that this 
superiority demonstrate that our proposals of the flower node, the add-sum cri- 
terion, and the Cramer’s V pruning are effective in multi-objective classification. 

6 Conclusion 

In this paper, we proposed a learning algorithm for a novel decision tree from 
a multi-objective data set. Conventional learning algorithms are ineffective in 
constructing an accurate and readable decision tree in multi-objective classifica- 
tion. Our D3-B constructs a single classifier: a bloomy decision tree for such a 
data set. In a bloomy decision tree, the number of class dimensions gradually de- 
creases by the use of flower nodes inside the tree. Experiments with eleven data 
sets showed that our D3-B, compared with C4.5 and other methods, typically 
constructs more accurate and/or smaller trees. 
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Abstract. Since hospital information systems have been introduced in 
large hospitals, a large amount of data, including laboratory examina- 
tions, have been stored as temporal databases. The characteristics of 
these temporal databases are: (1) Each record are inhomogeneous with 
respect to time-series, including short-term effects and long-term effects. 
(2) Each record has more than 1000 attributes when a patient is followed 
for more than one year. (3) When a patient is admitted for a long time, 
a large amount of data is stored in a very short term. Even medical 
experts cannot deal with these large databases, the interest in mining 
some useful information from the data are growing. In this paper, we 
introduce a combination of extended moving average method, multiscale 
matching and rule induction method to discover new knowledge in med- 
ical temporal databases. This method was applied to a medical dataset, 
the results of which show that interesting knowledge is discovered from 
each database. 



1 Introduction 

Since hospital information systems have been introduced in large hospitals, a 
large amount of data, including laboratory examinations, have been stored as 
temporal databases uni For example, in a university hospital, where more than 
1000 patients visit from Monday to Friday, a database system stores more than 
1 GB numerical data of laboratory examinations. Thus, it is highly expected 
that data mining methods will find interesting patterns from databases because 
medical experts cannot deal with those large amount of data. The characteristics 
of these temporal databases are: (1) Each record are inhomogeneous with respect 
to time-series, including short-term effects and long-term effects. (2) Each record 
has more than 1000 attributes when a patient is followed for more than one year. 
(3) When a patient is admitted for a long time, a large amount of data is stored in 
a very short term. Even medical experts cannot deal with these large temporal 
databases, the interest in mining some useful information from the data are 
growing. 

In this paper, we introduce a rule discovery method, combined with extended 
moving average method, multiscale matching for qualitative trend to discover 
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new knowledge in medical temporal databases. In this system, extended mov- 
ing average method and multi-scale matching are used for preprocessing, to 
deal with irregularity of each temporal data. Using several parameters for time- 
scaling, given by users, this moving average method generates a new database 
for each time scale with summarized attributes. For matching time sequences, 
multiscale matching was applied. Then, rule induction method is applied to each 
new database with summarized attributes. This method was applied to two med- 
ical datasets, the results of which show that interesting knowledge is discovered 
from each database. 

This paper is organized as follows. Section 2 introduces the definition of 
probabilistic rules. Section 3 discusses the characteristics of temporal databases 
in hospital information systems. Section 4 presents extended moving average 
method. Section 5 introduces second preprocessing methods to extract qualita- 
tive trend and rule discovery method with qualitative trend. Section 6 shows 
experimental results. Section 7 gives a brief discusson of the total method. Fi- 
nally, Section 8 concludes this paper. 

2 Probabilistic Rules and Conditional Probabilities 

Before discussing temporal knowledge discovery, we first discuss the character- 
istics of probabilistic rules. In this section, we use the following notations in- 
troduced by Grzymala-Busse and Skowron El, which are based on rough set 
theory Uni . 

Let U denote a nonempty, finite set called the universe and A denote a 
nonempty, finite set of attributes, i.e., a : U ^ Va for a G A, where Va is called 
the domain of a, respectively. Then, a decision table is defined as an information 
system, A = {U,A U {d}). The atomic formulae over B C AU {d} and V are 
expressions of the form [a = v\, called descriptors over B, where a G B and 
V G Va- The set F{B, V) of formulas over B is the least set containing all atomic 
formulas over B and closed with respect to disjunction, conjunction and nega- 
tion. For each / G F{B, U), Ja denote the meaning of / in A, i.e., the set of all 
objects in U with property /, defined inductively as follows. 

1. If / is of the form [a = u] then, /^ = {s g U\a{s) = u} 

2- (/ A g)A = /t n gA; (/ V g)A = fA\/ gA] {-'Da = U - fa 

By the use of the framework above, classification accuracy and coverage, or true 
positive rate is defined as follows. 

Let R and D denote a formula in F{B,V) and a set of objects which belong to 
a decision d. Classification accuracy and coverage(true positive rate) for i? — >■ c? 
is defined as: 

aDD) = P{D\R)), and kr{D) = P{R\D)), 

where [S'], a]i{D){= P{D\R)), K]i{D){= P{R\D)) and P(S) denote the cardinal- 
ity of a set S, a classification accuracy of R as to classification of D and coverage 
(a true positive rate of R to D), and probability of S, respectively. 
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By the use of accuracy and coverage, a probabilistic rule is defined as: 

R‘^ d s.t. R = Aj [a, = Vk], afi{D)(= P(D\R)) > 5a 
and kr{D){= P{R\D)) > 5^, 

For further information about these probabilistic rules, reader may refer to m 

3 Temporal Databases in Hospital Information Systems 

Since incorporating temporal aspects into databases is still an ongoing research 
issue in database area temporal data are stored as a table in hospital infor- 
mation systems(H.I.S.). Table 1 shows a typical example of medical data, which 
is retrieved from H.I.S. The first column denotes the ID number of each patient, 
and the second one denotes the date when the datasets in this row is exam- 
ined. Each row with the same ID number describes the results of laboratory 
examinations, which were taken on the date in the second column. For example, 
the second row shows the data of the patient ID I on 04/19/1986. This simple 
database show the following characteristics of medical temporal database: 

(1) The Number of Attributes Are Too Many. Even though the dataset 
of a patient focuses on the transition of each examination (attribute), it would 
be difficult to see its trend when the patient is followed for a long time. If one 
wants to see the long-term interaction between attributes, it would be almost 
impossible. In order to solve this problems, most of H.I.S. systems provide several 
graphical interfaces to capture temporal trends However, the interactions 
among more than three attributes are difficult to be studied even if visualization 
interfaces are used. 

(2) Irregularity of Temporal Intervals. Temporal intervals are irregular. 
Although most of the patients will come to the hospital every two weeks or one 
month, physicians may not make laboratory tests at each time. When a patient 
has a acute fit or suffers from acute diseases, such as pneumonia, laboratory 
examinations will be made every one to three days. On the other hand, when 
his/her status is stable, these test may not be made for a long time. Patient 
ID 1 is a typical example. Between 04/30 and 05/08/1986, he suffered from 
a pneumonia and was admitted to a hospital. Then, during the therapeutic 
procedure, laboratory tests were made every a few days. On the other hand, 
when he was stable, such tests were ordered every one or two year. 

(3) Missing Values. In addition to irregularity of temporal intervals, datasets 
have many missing values. Even though medical experts will make laboratory 
examinations, they may not take the same tests in each instant. Patient ID 
1 in Table 1 is a typical example. On 05/06/1986, medical physician selected 
a specific test to confirm his diagnosis. So, he will not choose other tests. On 
01/09/1989, he focused only on GOT, not other tests. In this way, missing values 
will be observed very often in clinical situations. 
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Table 1. An example of temporal database 



ID Date GOT GPT LDH 7 -GTP TP edema 



1 


19860419 


24 


12 


152 


63 


7.5 


- 


1 


19860430 


25 


12 


162 


76 


7.9 


-b ■ • ■ 


1 


19860502 


22 


8 


144 


68 


7.0 


-b ■ • ■ 


1 


19860506 














1 


19860508 


22 


13 


156 


66 


7.6 


_ 


1 


19880826 


23 


17 


142 


89 


7.7 


_ 


1 


19890109 


32 










_ 


1 


19910304 


20 


15 


369 


139 


6.9 


-b ■ • ■ 


2 


19810511 


20 


15 


369 


139 


6.9 


- 


2 


19810713 


22 


14 


177 


49 


7.9 


_ 


2 


19880826 


23 


17 


142 


89 


7.7 


_ 


2 


19890109 


32 










- 



These characteristics have already been discussed in KDD area 0. However, 
in real-world domains, especially domains in which follow-up studies are crucial, 
such as medical domains, these ill-posed situations will be distinguished. If one 
wants to describe each patient (record) as one row, then each row have too 
many attributes, which depends on how many times laboratory examinations 
are made for each patient. It is notable that although the above discussions are 
made according to the medical situations, similar situations may occur in other 
domains with long-term follow-up studies. 

4 Extended Moving Average Methods 

4.1 Moving Average Methods 

Averaging mean methods have been introduced in statistical analysis 0 . Tempo- 
ral data often suffers from noise, which will be observed as a spike or sharp wave 
during a very short period, typically at one instant. Averaging mean methods 
remove such an incidental effect and make temporal sequences smoother. 

With one parameter w, called window, moving average is defined as fol- 
lows: 

W 

Vw = ^yj- 

t=i 

For example, in the case of GOT of patient ID I, 2/5 is calculated as: = 

(24 -I- 25 -I- 22 -I- 22 -b 22) jb = 23.0. It is easy to see that y^ will remove the noise 
effect which continue less than w points. 

The advantage of moving average method is that it enables to remove the 
noise effect when inputs are given periodically |^. For example, when some 
tests are measured every several day^, the moving average method is useful 



^ This condition guarantees that measnrement is approximately continuons 
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to remove the noise and to extract periodical domains. However, in real-world 
domains, inputs are not always periodical, as shown in Table 1. Thus, when ap- 
plied time-series are irregular or discrete, ordinary moving average methods are 
powerless. Another disadvantage of this method is that it cannot be applicable 
to categorical attributes. In the case of numerical attributes, average can be used 
as a summarized statistic. On the other hand, such average cannot be defined 
for categorical attributes. 

Thus, we introduce the extended averaging method to solve these two prob- 
lems in the subsequent subsections. 

4.2 Extended Moving Average for Continuous Attributes 

In this extension, we first focus on how moving average methods remove noise. 
The key idea is that a window parameter w is closely related with periodicity. 
If w is larger, then the periodical behavior whose time-constant is lower than 
w will be removed. Usually, a spike by noise is observed as a single event and 
this effect will be removed when w is taken as a large value. Thus, the choice 
of w separates different kinds of time-constant behavior in each attribute and 
in the extreme case when w is equal to total number of temporal events, all the 
temporal behavior will be removed. We refer to this extreme case as w = oo. 

The extended moving average method is executed as follows: first calculates 
j/oo for an attribute y. Second, the method outputs its maximum and minimum 
values. Then, according to the selected values for w, a set of sequence {Vwi'i)} 
for each record is calculated. For example, if {w } is equal to {10 years, 5 years, 
1 year, 3 months, 2 weeks}, then for each element in |w}, the method uses 
the time-stamp attribute for calculation of each {yw{i)} in order to deal with 
irregularities. 

In the case of Table 1, when w is taken as 1 year, all the rows are aggregated 
into several components as shown in Table 2. From this aggregation, a sequence 
yyj for each attribute is calculated as in Table 3. 



Table 2. Aggregation for w = 1 (year) 



ID Date GOT GPT LDH y-GTP TP edema 



1 


19860419 


24 


12 


152 


63 


7.5 


- 


1 


19860430 


25 


12 


162 


76 


7.9 


-t ••• 


1 


19860502 


22 


8 


144 


68 


7.0 


-t • • • 


1 


19860506 














1 


19860508 


22 


13 


156 


66 


7.6 


- 


T 


19880826 


23 


17 


142 


89 


7.7 


- 


T 


19890109 


32 










- 


T 


19910304 


20 


15 


369 


139 


6.9 


-f ••• 
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Table 3. Moving average for w = 1 (year) 



ID Period GOT GPT LDH 7 -GTP TP edema 



1 


1 


23.25 11.25 153.5 


68.25 


7.5 


? 


1 


2 


23 


17 


142 


89 


7.7 


? 


1 


3 


32 










? 


1 


4 












? 


1 


5 


20 


15 


369 


139 


6.9 


? 


1 


00 


24 


12.83 187.5 


83.5 


7.43 


? 



The selection of w can be automated. The simpliest way to calculate w is to 
use the power of natural number, such as 2. For example, we can use 2" as the 

window length: 2, 4, 8, 16, Using this scale, two weeks, three months, one 

year correspond to 16 = 2"^, 64 = 2®, 256 = 2®. 



4.3 Categorical Attributes 

One of the disadvantages of moving average method is that it cannot deal 
with categorical attributes. To solve this problem, we will classify categorical 
attributes into three types, whose information should be given by users. The 
first type is constant, which will not change during the follow-up period. The 
second type is ranking, which is used to rank the status of a patient. The third 
type is variable, which will change temporally, but ranking is not useful. For the 
first type, extended moving average method will not be applied. For the second 
one, integer will be assigned to each rank and extended moving average method 
for continuous attributes is applied. On the other hand, for the third one, the 
temporal behavior of attributes is transformed into statistics as follows. 

First, the occurence of each category (value) is counted for each window. 
For example, in Table 2, edema is a binary attribute and variable. In the first 
window, an attribute edema takes -}0 So, the occurence of — and -I- are 

2 and 2, respectively. Then, each conditional probability will be calculated. In 
the above example, probabilities are equal to p{—\wi) =2/4 and p(-|-|wi) = 2/4. 
Finally, for each probability, a new attribute is appended to the table (Table 4). 



Summary of Extended Moving Average. All the process of extended mov- 
ing average is used to construct a new table for each window parameter as the 
first preprocessing. Then, second preprocessing method will be applied to newly 
generated tables. The first preprocessing method is summarized as shown in 
Fig. 1. 



Missing values are ignored for counting. 



2 
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Table 4. Final table with moving average for w = 1 (year) 



ID Period GOT GPT LDH 7-GTP TP edema(+) edema(-) 



1 


1 


23.25 11.25 153.5 


68.25 


7.5 


0.5 


0.5 


1 


2 


23 


17 


142 


89 


7.7 


0.0 


1.0 


1 


3 


32 










0.0 


1.0 


1 


4 












0.0 


1.0 


1 


5 


20 


15 


369 


139 


6.9 


1.0 


0.0 


1 


00 


24 


12.83 187.5 


83.5 


7.43 


0.43 


0.57 



1. Repeat for each w in List L™, 

a) Select an attribute in a List L^; 

i. If an attribute is numerical, then calculate moving average for w; 

ii. If an attribute is constant, then break; 

iii. If an attribute is rank, then assign integer to each ranking; 
calculate moving average for w; 

iv. If an attribute is variable, calculate accuracy and coverage of each cate- 
gory; 

b) If La is not empty, goto (a). 

c) Construct a new table with each moving average. 

2. Construct a table for w = 00. 



Fig. 1. First preprocessing method 



5 Second Preprocessing and Rule Discovery 

5.1 Summarizing Temporal Sequences 

From the data table after processing extended moving average methods, several 
preprocessing methods may be applied in order for users to detect the temporal 
trends in each attribute. One way is discretization of time-series by clustering 
introduced by Das 0. This method transforms time-series into symbols repre- 
senting qualitative trends by using a similarity measure. Then, time-series data 
is represented as a symbolic sequence. After this preprocessing, rule discovery 
method is applied to this sequential data. Another way is to find auto-regression 
equations from the sequence of averaging means. Then, these quantitative equa- 
tions can be directly used to extract knowledge or their qualitative interpretation 
may be used and rule discovery |3| , other machine learning methods |2j , or rough 
set method H2I can be applied to extract qualitative knowledge. 

In this research, we adopt two modes and transforms databases into two 
forms: one mode is applying temporal abstraction method 0 with multiscale 
matching jSj as second preprocessing and transforms all continuous attributes 
into temporal sequences. The other mode is applying rule discovery to the data 
after the first preprocessing without second one. The reason why we adopted 
these two mode is that we focus not only on temporal behavior of each attribute, 
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but also on association among several attributes. Although Miksch’s method jO] 
and Das’s approach 0 are very efficient to extract knowledge about transition, 
they cannot focus on association between attributes in an efficient way. For the 
latter purpose, much simpler rule discovery algorithm are preferred. 



5.2 Continuous Attributes and Qualitative Trend 

To characterize the deviation and temporal change of continuous attributes, we 
introduce standardization of continuous attributes. For this, we only needs the 
total average yoo and its standardization CToo- With these parameters, standard- 
ized value is obtained as: 

Uw yoo 
Zw = ■ 

^oo 

The reason why standardization is introduced is that it makes comparison be- 
tween continuous attributes much easier and clearer, especially, statistic theory 
guarantees that the coefficients of a auto-regression equation can be compared 
with those of another equation jSj. 

After calculating the standardized values, an extraction algorithm for qual- 
itative trends is applied j0| with multiscale matching briefly shown in the next 
subsection. 

This method is processed as follows: First, this method uses data smooth- 
ing with window parameters. Secondly, smoothed values for each attributes are 
classified into seven categories given as domain knowledge about laboratory test 
values: extremely low, substantially low, slightly low, normal range, slightly high, 
substantially high, and extremely high. With these categories, qualitative trends 
are calculated and classifled into the following ten categories by using guideline 
rules: decrease too fast(Al), normal decrease(A2), decrease too slow(A3), zero 
change(ZA), dangerous increase(C), increase too fast(Bl), normal increase(B2), 
increase too slow(B3), dangerous decrease(D). For matching temporal sequences 
with guideline rules, multiscale matching method is applied. For example, if the 
value of some laboratory tests change from substantially high to normal range 
within a very short time, the qualitative trend will be classifled into A1 (decrease 
too fast). For further information, please refer to P). 



5.3 Multiscale Matching 

Multiscale matching is based on two basic ideas: the first one is to use the 
curvature of the curve to detect the points of inflection. The second idea is to 
use the scale factor to calculate the curvature of the smoothed curve 0. The 
curvature is given as: 

(f\ - y" 

(l + (y')2)3/2’ 

where y' = dy/dt and y" = d'^y/dt^. In order to compute the curvature of the 
curve at varying levels of detail, function y is convolved with a one-dimensional 
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Gaussian kernel g{t,a) of the width (scaling factor) ct: 




Y{t, a), the convolution of y{t) is defined as: 



■OO 




y{t,cr) = y{t) g{t,a) = y{t) 



— OO 



According to the characteristics of the convolution, the derivative and the second 
derivative is calculated as: 



Using Y{t,a), Y'{t,a) and Y"{t,a), we can calculate the curvature of a given 
curve for each value of a within one window w: 



This gives a sequence of the value of curvature for each time series. If two time 
series sequence is similar with respect to temporal change, two sequences of 
curvature will be similar. Furthermore, since we calculate the curvature for each 
scaling factor, we can compare between these sequences from the local level to 
global level. For further information, please refer to |S| and |S|. 

5.4 Rule Discovery Algorithm 

For rule discovery, a simple rule induction algorithm discussed in uni is applied, 
where continuous attributes are transformed into categorical attributes with a 
cut-off point. As discussed in Sect. 3, moving average method will remove the 
temporal effect shorter than a window parameter. Thus, w = oo will remove all 
the temporal effect, so this moving average can be viewed as data without any 
temporal characteristics. If rule discovery is applied to this data, it will generate 
rules which represents non-temporal association between attributes. In this way, 
data after processing ic-moving average is used to discover association with w 
or longer time-effect. Ideally, from w = oo down to rc = 1, we decompose all 
the independent time-effect associations between attributes. However, the time- 
constant in which users are interested will be limited and the moving average 
method shown in Sect. 3 uses a set of w given by users. Thus, application of rule 
discovery to each table will generate a sequence of temporal associations between 
attributes. If some temporal associations will be different from associations with 
w = oo, then these specific relations will be related with a new discovery. 




c{t) 




Discovery of Temporal Knowledge in Medical Time-Series Databases 457 



5.5 Summary of Second Preprocessing and Rule Discovery 

Second preprocessing method and rule discovery are summarized as shown in 
Fig. 2. 



1. Calculate yaa and ctoo from the table of w = oo; 

2. Repeat for each w in List 

(w is sorted in a descending order.) 

a) Select a table of w: T™; 

i. Standardize continuous and ranking attributes; 

ii. Calculate qualitative trends for continuous and ranking attributes with 
multiscale matching; 

iii. Construct a new table for qualitative trends; 

iv. Apply rule discovery method for temporal sequences; 

b) Apply rule induction methods to the original table T^; 



Fig. 2. Second preprocessing and rule discovery 



6 Experimental Results: Discovered Results in CVD 

The above rule discovery system was applied to a clinical database on cere- 
brovascular diseases (CVD), which has 2610 records, described by 12 classes. 
Each record followed up at least 10 years and the averaged number of attributes 
are 2715. A list of w, {w} was set to {10 years, 5 years, 1 year, 3 months, 2 
weeks} and thresholds, Sa and were set to 0.60 and 0.30,respectively. One of 
the most important problems in CVD is whether CVD patients will suffer from 
mental disorders or dementia and how long it takes each patient to reach the 
status of dementia. 

6.1 Non-temporal Knowledge 

Concerning the database on CVD, several interesting rules are derived. The most 
interesting results are the following positive and negative rules for thalamus 
hemorrhage: 

[Sex = Female] A [Flemiparesis = Left] A [LOC : positive] — >■ Thalamus 
{accuracy : 0.62, coverage : 0.33), 

[Risk : Flypertension] A [Sensory = no] — >■ Putamen 
{accuracy : 0.66, coverage : 0.43), 

Interestingly, LOC(loss of consciousness) under the condition of [Sex = 
Female] A [Flemiparesis = Left] is an important factor to diagnose thalamic 
damage. In this domain, any strong correlations between these attributes and 
others, like MND, have not been found yet. It will be our future work to find 
what factor will be behind these rules. However, these rules do not include the 
relations between dementia and brain functions. 
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Short-Term Effect. As short-term rules, the following interesting rules are 
discovered: 

[Gastro : Al] A [Quadriceps : Al] — >■ [Dementia : A2] 

{accuracy : 0.71, coverage : 0.31, w = ^{months)), 

[Gastro : D] A [T A : D] -A [Dementia : A2] 

{accuracy : 0.74, coverage : 0.32, w = 3{months)). 

These rules suggests that the rapid decrease of muscle power in the lower ex- 
tremities are weakly related with the appearance of dementia. However, these 
knowledge has never been reported and further investigation is required for in- 
terpretation. 

Long-Term Effect. As long-term rules, the following interesting rules are dis- 
covered: 

[Joint Positicm : A3] A [Quadriceps : A3] — )> [Dementia : A3] 
{accuracy : 0.61, coverage : 0.35, w = l{year)), 

[Gastro : A3] A [Vibration : A3] -A [Dementia : A3] 

{accuracy : 0.87, coverage : 0.33, w = l{year)). 

These rules suggests that combination of the decrease of muscle power in the 
lower extremities and the increase of sensory disturbance are weakly related 
with the appearance of dementia. However, these knowledge has neither been 
reported and further investigation is required for interpretation. 

7 Discussion 

This paper introduces combination of extended moving average methods as first 
preprocessing, extraction of qualitative trend as second preprocessing and rule 
discovery. As discussed in Sects. 3 and 4, this approach is inspired by rule discov- 
ery in time series introduced by Das 0. However, the main differences between 
Das’s approach and our approach are the following. 

1. For smoothing data, extended moving average method is introduced. 

2. The system incorporates domain knowledge about a continous attribute to 
detect its qualitative trend. 

3. Qualitative trend are calculated for each time-constant. 

4. Rules are discovered with respect to not only associations between qualitative 
trends but also non-temporal associations. 

Using these methods, the system discovered several interesting patterns in a 
clinical database of different time constant. 

The disadvantage of this approach is that the program is not good at extract- 
ing periodical behavior of disease processes, or the recurrence of some diseases 
because the qualitative trends do not support the detection of cycles in temporal 
behavior of attributes. For these periodical processes, auto-regressive function 
analysis is much more useful j0| • It will be our future work to extend our approach 
so that it can deal with periodicity more clearly. 
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8 Conclusion 

In this paper, we introduce a combination of extended moving average method, 
multiscale matching and rule induction method, to discover new knowledge in 
temporal databases. In the system, extended moving average method are used 
for preprocessing, to deal with irregularity of each temporal data. Using sev- 
eral parameters for time-scaling with multiscale matching, given by users, this 
moving average method generates a new database for each time scale with sum- 
marized attributes. Then, rule induction method is applied to each new database 
with summarized attributes. This method was applied to two medical datasets, 
the results of which show that interesting knowledge is discovered from each 
database. 
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Abstract. One of the most important problems on rule induction meth- 
ods is that extracted rules partially represent information on experts’ 
decision processes, which makes rule interpretation by domain experts 
difficult. In order to solve this problem, the characteristics of medical 
reasoning is discussed positive and negative rules are introduced which 
model medical experts’ rules. Then, for induction of positive and negative 
rules, two search algorithms are provided. The proposed rule induction 
method was evaluated on medical databases, the experimental results of 
which show that induced rules correctly represented experts’ knowledge 
and several interesting patterns were discovered. 



1 Introduction 



Rule induction methods are classified into two categories, induction of deter- 
ministic rules and probabilistic ones PJ2E0- 0^6 hand. Deterministic rules 
are described as if-then rules, which can be viewed as propositions. From the 
set-theoretical point of view, a set of examples supporting the conditional part 
of a deterministic rule, denoted by C, is a subset of a set whose examples belongs 
to the consequence part, denoted by D. That is, the relation CCD holds and 
deterministic rules are supported only by positive examples in a dataset. On the 
other hand, probabilistic rules are if-then rules with probabilistic information 
2]. From the set-theoretical point of view, C is not a subset, but closely over- 
lapped with D. That is, the relations CnD ^ (f and ICnDl/ICj > <5 will hold in 
this case.0 Thus, probabilistic rules are supported by a large number of positive 
examples and a small number of negative examples. The common feature of both 
deterministic and probabilistic rules is that they will deduce their consequence 
positively if an example satisfies their conditional parts. We call the reasoning 
by these rules positive reasoning. 

However, medical experts do not use only positive reasoning but also nega- 
tive reasoning for selection of candidates, which is represented as if-then rules 
whose consequences include negative terms. For example, when a patient who 
complains of headache does not have a throbbing pain, migraine should not be 



^ The threshold 5 is the degree of the closeness of overlapping sets, which will be given 
by domain experts. For more information, please refer to Sect. 3. 
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suspected with a high probability. Thus, negative reasoning also plays an im- 
portant role in cutting the search space of a differential diagnosis process 
Thus, medical reasoning includes both positive and negative reasoning, though 
conventional rule induction methods do not reflect this aspect. This is one of 
the reasons why medical experts have difficulties in interpreting induced rules 
and the interpretation of rules for a discovery procedure does not easily proceed. 
Therefore, negative rules should be induced from databases in order not only to 
induce rules reflecting experts’ decision processes, but also to induce rules which 
will be easier for domain experts to interpret, both of which are important to 
enhance the discovery process done by the corporation of medical experts and 
computers. 

In this paper, first, the characteristics of medical reasoning are focused on 
and two kinds of rules, positive rules and negative rules are introduced as a model 
of medical reasoning. Interestingly, from the set-theoretical point of view, sets of 
examples supporting both rules correspond to the lower and upper approxima- 
tion in rough sets P). On the other hand, from the viewpoint of propositional 
logic, both positive and negative rules are defined as classical propositions, or 
deterministic rules with two probabilistic measures, classification accuracy and 
coverage. Second, two algorithms for induction of positive and negative rules are 
introduced, defined as search procedures by using accuracy and coverage as eval- 
uation indices. Finally, the proposed method was evaluated on several medical 
databases, the experimental results of which show that induced rules correctly 
represented experts’ knowledge and several interesting patterns were discovered. 



2 Focusing Mechanism 

One of the characteristics in medical reasoning is a focusing mechanism, which 
is used to select the final diagnosis from many candidates Pj. For example, 
in differential diagnosis of headache, more than 60 diseases will be checked by 
present history, physical examinations and laboratory examinations. In diagnos- 
tic procedures, a candidate is excluded if a symptom necessary to diagnose is 
not observed. 

This style of reasoning consists of the following two kinds of reasoning pro- 
cesses: exclusive reasoning and inclusive reasoning]! The diagnostic procedure 
will proceed as follows: first, exclusive reasoning excludes a disease from candi- 
dates when a patient does not have a symptom which is necessary to diagnose 
that disease. Secondly, inclusive reasoning suspects a disease in the output of 
the exclusive process when a patient has symptoms specific to a disease. These 
two steps are modelled as usage of two kinds of rules, negative rules (or ex- 
clusive rules) and positive rules, the former of which corresponds to exclusive 
reasoning and the latter of which corresponds to inclusive reasoning. In the next 
two subsections, these two rules are represented as special kinds of probabilistic 
rules. 



^ Relations this diagnostic model with another diagnostic model are discussed in 0. 
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3 Definition of Rules 

3.1 Rough Sets 



In the following sections, we use the following notations introduced by Grzymala- 
Busse and Skowron 0, which are based on rough set theory P|. These notations 
are illustrated by a small database shown in Table 1, collecting the patients who 
complained of headache. 



Table 1. An example of database 



No. 


age 


location 


nature 


prodrome nausea Ml 


class 


1 


50-59 occular persistent 


no 


no 


yes 


m.c.h. 


2 


40-49 


whole 


persistent 


no 


no 


yes 


m.c.h. 


3 


40-49 


lateral 


throbbing 


no 


yes 


no 


migra 


4 


40-49 


whole 


throbbing 


yes 


yes 


no 


migra 


5 


40-49 


whole 


radiating 


no 


no 


yes 


m.c.h. 


6 


50-59 


whole 


persistent 


no 


yes 


yes psycho 



Definitions. Ml: tenderness of Ml, m.c.h.; muscle 
contraction headache, migra: migraine, psycho: 
psychological pain. 



Let U denote a nonempty, finite set called the universe and A denote a 
nonempty, finite set of attributes, i.e., a : U ^ Va iov a € A, where Va is called 
the domain of a, respectively. Then, a decision table is defined as an information 
system, A = {U,A U {d}). For example. Table 1 is an information system with 
U = {1, 2, 3, 4, 5, 6} and A = {age, location, nature, prodrome, nausea, Ml} and 
d = class. For location € A, Viocation is defined as {occular, lateral, whole} . 

The atomic formulae over B C A U {d} and V are expressions of the form 
[a = u], called descriptors over B, where a G B and v G Va- The set F{B, V) of 
formulas over B is the least set containing all atomic formulas over B and closed 
with respect to disjunction, conjunction and negation. For example, [location = 
occular] is a descriptor of B. 

For each / G F{B,V), Ja denote the meaning of / in A, i.e., the set of all 
objects in U with property /, defined inductively as follows. 



1. If / is of the form [a = u] then, /^ = {s G C/|a(s) = u} 

2- (/ A g)A = /a n gA', (/ V p)a = /a V gA', hf)A = U - fa 

For example, / = [location = whole] and Ja = {2,4, 5,6}. As an example of a 
conjunctive formula, g = [location = whole] A [nausea = no] is a descriptor of 
U and fA is equal to giocation , nausea — {2,5}. 
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3.2 Classification Accuracy and Coverage 

Definition of Accuracy and Coverage. By the use of the framework above, 
classification accuracy and coverage, or true positive rate is defined as follows. 

Definition 1. 

Let R and D denote a formula in F(B,V) and a set of objects which belong to 
a decision d. Classification accuracy and coverage (true positive rate) for R ^ d 
is defined as: 



adD) = P{D\R)), cind 

P{R\D)), 

where [S'!, an{D), kr{D) and P(S) denote the cardinality of a set S, a classifi- 
cation accuracy of R as to classification of D and coverage ( a true positive rate 
of R to D ), and probability of S, respectively. 

In the above example, when R and D are set to [nau = 1] and [class = 
migraine], an{D) = 2/3 = 0.67 and kr{D) = 2/2 = 1.0. 

It is notable that o_r(D) measures the degree of the sufficiency of a proposi- 
tion, R ^ D, and that rr{D) measures the degree of its necessity. For example, 
if an{D) is equal to 1.0, then R ^ D is true. On the other hand, if kr{D) is 
equal to 1.0, then — >■ i? is true. Thus, if both measures are 1.0, then R D. 

3.3 Probabilistic Rules 

By the use of accuracy and coverage, a probabilistic rule is defined as: 

R‘^ d s.t. R = Aj[aj = Vk\,an{D) > 
and kr{D) > 6^, 

This rule is a kind of probabilistic proposition with two statistical measures, 
which is an extension of Ziarko’s variable precision model(VPRS) 00 

It is also notable that both a positive rule and a negative rule are defined as 
special cases of this rule, as shown in the next subsections. 

3.4 Positive Rules 

A positive rule is defined as a rule supported by only positive examples, the clas- 
sification accuracy of which is equal to 1.0. It is notable that the set supporting 
this rule corresponds to a subset of the lower approximation of a target concept, 
which is introduced in rough sets 0. Thus, a positive rule is represented as: 

R^d s.t. R=Aj[aj=Vk], afi{D) = 1.0 



® This probabilistic rule is also a kind of Rough Modus Ponens UDI. 
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In the above example, one positive rule of “m.c.h.” (muscle contraction headache) 
is: 

[nausea = no] — >■ m.c.h. a = 3/3 = 1.0. 

This positive rule is often called a deterministic rule. However, in this paper, 
we use a term, positive (deterministic) rules, because a deterministic rule which 
is supported only by negative examples, called a negative rule, is introduced as 
in the next subsection. 

3.5 Negative Rules 

Before defining a negative rule, let us first introduce an exclusive rule, the con- 
trapositive of a negative rule 0 . An exclusive rule is defined as a rule supported 
by all the positive examples, the coverage of which is equal to 1.00 It is notable 
that the set supporting an exclusive rule corresponds to the upper approxima- 
tion of a target concept, which is introduced in rough sets j^]. Thus, an exclusive 
rule is represented as: 

d s.t. R = Vj[aj = Ufe], hr{D) = 1.0. 

In the above example, the exclusive rule of “m.c.h.” is: 

[Ml = yes] V [nau = no] — >■ m.c.h. k = 1.0, 

From the viewpoint of propositional logic, an exclusive rule should be represented 
as: 

d ^ ^ j]^j — '^k]^ 

because the condition of an exclusive rule corresponds to the necessity condition 
of conclusion d. Thus, it is easy to see that a negative rule is defined as the 
contrapositive of an exclusive rule: 

which means that if a case does not satisfy any attribute value pairs in the 
condition of a negative rules, then we can exclude a decision d from candidates. 
For example, the negative rule of m.c.h. is: 

-■[Ml = yes] A ~i[nausea = no] — >■ -<m.c.h. 

In summary, a negative rule is defined as: 

— Ufc] y ~*d s.t. — '^k] n^aj—Vk]i.^) — 1.0, 

where D denotes a set of samples which belong to a class d. 

Negative rules should be also included in a category of deterministic rules, 
since their coverage, a measure of negative concepts is equal to 1.0. It is also no- 
table that the set supporting a negative rule corresponds to a subset of negative 
region, which is introduced in rough sets |^. 

^ An exclusive rule represents the necessity condition of a decision. 
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4 Algorithms for Rule Induction 

The contrapositive of a negative rule, an exclusive rule is induced as an exclusive 
rule by the modification of the algorithm introduced in PRIMEROSE-REX ^ , 
as shown in Fig. 1. This algorithm will work as follows. (l)First, it selects a 
descriptor [at = Vj] from the list of attribute-value pairs, denoted by L. (2) 
Then, it checks whether this descriptor overlaps with a set of positive examples, 
denoted by D. (3) If so, this descriptor is included into a list of candidates for 
positive rules and the algorithm checks whether its coverage is equal to 1.0 or 
not. If the coverage is equal to 1.0, then this descriptor is added to ReV, the 
formula for the conditional part of the exclusive rule of D. (4) Then, [a^ = Vj\ is 
deleted from the list L. This procedure, from (1) to (4) will continue unless L is 
empty. (5) Finally, when L is empty, this algorithm generates negative rules by 
taking the contrapositive of induced exclusive rules. 

On the other hand, positive rules are induced as inclusive rules by the algo- 
rithm introduced in PRIMEROSE-REX P|, as shown in Fig. 2. For induction 
of positive rules, the threshold of accuracy and coverage is set to 1.0 and 0.0, 
respectively. 

This algorithm works in the following way. (1) First, it substitutes Li, which 
denotes a list of formula composed of only one descriptor, with the list Lgr gener- 
ated by the former algorithm shown in Fig. 1. (2) Then, until L\ becomes empty, 
the following procedures will continue: (a) A formula [a^ = Vj\ is removed from 
L\. (b) Then, the algorithm checks whether an{D) is larger than the threshold 
or not. (For induction of positive rules, this is equal to checking whether au{D) 
is equal to 1.0 or not.) If so, then this formula is included a list of the conditional 
part of positive rules. Otherwise, it will be included into M, which is used for 
making conjunction. (3) When Li is empty, the next list L 2 is generated from 
the list M. 

5 Experimental Results 

For experimental evaluation, a new system, called PRIMEROSE-REX2 (Prob- 
abilistic Rule Induction Method for Rules of Expert System ver 2.0), was devel- 
oped, where the algorithms discussed in Sect. 4 were implemented. 

PRIMEROSE-REX2 was applied to the following three medical domains: 
headache(RHINOS domain), whose training samples consist of 52119 samples, 
45 classes and 147 attributes, cerebulovasular diseases(CVD), whose training 
samples consist of 7620 samples, 22 classes and 285 attributes, and meningitis, 
whose training samples consists of 1211 samples, 4 classes and 41 attributes 
(Table 2). 

For evaluation, we used the following two types of experiments. One ex- 
periment was to evaluate the predictive accuracy by using the cross-validation 
method, which is often used in the machine learning literature |2|. The other ex- 
periment was to evaluate induced rules by medical experts and to check whether 
these rules led to a new discovery. 
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procedure Exclusive and Negative Rules; 

var 

L : List; 

/* A list of elementary attribute-value pairs */ 
begin 
L ~ Po; 

/* Pq: a list of elementary attribute-value pairs given in a database * / 
while {L 7^ {}) do 
begin 

Select one pair [ai = Vj] from L; 

if ([fli = Vj]A C] D ^ (j>) then do /* D: positive examples of a target class d */ 

begin 

Lir ~ Lir + [ai = Vj]; /* Candidates for Positive Rules */ 

if {K[a-=Vj]{D) = 1.0) 
then Rer ■= Rer A [«i = Vj]; 

/* Include [oi = Vj] into the formula of Exclusive Rule */ 

end 

L ~ L —]ai = Vj]; 

end 

Construct Negative Rules: 

Take the contrapositive of Rsr- 
end {Exclusive and Negative Rules}; 

Fig. 1. Induction of exclusive and negative rules 



Table 2. Databases 



Domain Samples Classes Attributes 



Headache 


52119 


45 


147 


CVD 


7620 


22 


285 


Meningitis 


1211 


4 


41 



5.1 Performance of Rules Obtained 

For comparison of performance, The experiments were performed by the follow- 
ing four procedures. First, rules were acquired manually from experts. Second, 
the datasets were randomly splits into new training samples and new test sam- 
ples. Third, PRIMEROSE-REX2, conventional rule induction methods, AQ15 
IP and C4.5 |2| were applied to the new training samples for rule generation. 
Fourth, the induced rules and rules acquired from experts were tested by the new 
test samples. The second to fourth were repeated for 100 times and average all 
the classification accuracy over 100 trials. This process is a variant of repeated 
2-fold cross-validation, introduced in p. 

Experimental results(performance) are shown in Table 3. The first and sec- 
ond row show the results obtained by using PRIMROSE-REX2: the results in the 
first row were derived by using both positive and negative rules and those in the 
second row were derived by only positive rules. The third row shows the results 
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procedure Positive Rules; 

var 

i : integer; M, Li : List; 
begin 

Z/l := Lir', 

/* Lir'. A list of candidates generated by induction of exclusive rules */ 

*:= 1; M:={}; 

for i := 1 to n do 

/* n: Total number of attributes given 
in a database * / 

begin 

while ( 7 ^ {} ) do 

begin 

Select one pair R = A[ai = Vj] from Li; 

Li ~ Li - {R}; 
if (aR{D) > Sa) 

then do Sir := Sir + {7?}; 

/* Include i? in a list of the Positive Rules */ 
else M := M + {-R}; 
end 

Li+i ;= (A list of the whole combination of the conjunction formulae in M); 

end 

end {Positive Rules}; 



Fig. 2. Induction of positive rules 



derived from medical experts. For comparison, we compare the classification ac- 
curacy of C4.5 and AQ-15, which is shown in the fourth and the fifth row. These 



Table 3. Experimental results (accuracy: averaged) 



Method Headache CVD Meningitis 



PRIMEROSE-REX2 (Positive-fNegative) 
PRIMEROSE-REX2 (Positive) 


91.3% 

68.3% 


89.3% 

71.3% 


92.5% 

74.5% 


Experts 


95.0% 


92.9% 


93.2% 


C4.5 


85.8% 


79.7% 


81.4% 


AQ15 


86.2% 


78.9% 


82.5% 



results show that the combination of positive and negative rules outperforms 
positive rules, although it is a little worse than medical experts’ rules. 

5.2 What Is Discovered? 

Positive Rules in Meningitis. In the domain of meningitis, the following 
positive rules, which medical experts do not expect, are obtained. 
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[WBC < 12000] A [Sex = Female] A [Age < 40] 

A[CSF_CELL < 1000] ^ Virus 
[Age > 40] A [WBC > 8000] A [Sex = Male] 

A[CSF.CELL > 1000] ^ Bacteria 

The former rule means that if WBC(White Blood Cell Count) is less than 
12000, the Sex of a patient is FEMALE, the Age is less than 40 and CSF.CELL 
(Cell count of Cerebulospinal Fluid), then the type of meningitis is Virus. The 
latter one means that the Age of a patient is less than 40, WBC is larger than 
8000, the Sex is Male, and CSF_CELL is larger than 1000, then the type of 
meningitis is Bacteria. 

The most interesting points are that these rules included information about 
age and sex, which often seems to be unimportant attributes for differential 
diagnosis of meningitis. The first discovery was that women did not often suffer 
from bacterial infection, compared with men, since such relationships between 
sex and meningitis has not been discussed in medical context m By the close 
examination of the database of meningitis, it was found that most of the above 
patients suffered from chronic diseases, such as DM, LC, and sinusitis, which are 
the risk factors of bacterial meningitis. The second discovery was that [age < 40] 
was also an important factor not to suspect viral meningitis, which also matches 
the fact that most old people suffer from chronic diseases. 

These results were also re-evaluated in medical practice. Recently, the above 
two rules were checked by additional 21 cases who suffered from meningitis 
(15 cases: viral and 6 cases: bacterial meningitis.) Surprisingly, the above rules 
misclassified only three cases (two are viral, and the other is bacterial), that is, 
the total accuracy was equal to 18/21 = 85.7% and the accuracies for viral and 
bacterial meningitis were equal to 13/15 = 86.7% and 5/6 = 83.3%. The reasons 
of misclassification were the following: a case of bacterial infection was a patient 
who had a severe immunodeficiency, although he is very young. Two cases of 
viral infection were patients who also suffered from herpes zoster. It is notable 
that even those misclassification cases could be explained from the viewpoint of 
the immunodeficiency: that is, it was confirmed that immunodeficiency is a key 
word for meningitis. 

The validation of these rules is still ongoing, which will be reported in the 
near future. 

Positive and Negative Rules in CVD. Concerning the database on CVD, 
several interesting rules were derived. The most interesting results were the fol- 
lowing positive and negative rules for thalamus hemorrhage: 

[Sex = Eemale] A [Flemiparesis = Left] 

A[L0C : positive] -A Thalamus 
-•[Risk : Ftypertension] A -i[Sensory = no] 

— ^ -^Thalamus 

The former rule means that if the Sex of a patient is female and he/she suf- 
fered from the left hemiparesis([Hemiparesis=Left]) and loss of consciousness 
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([LOC:positive]), then the focus of CVD is Thalamus. The latter rule means 
that if he/she neither suffers from hypertension ([Risk: Hypertension]) nor suffers 
from sensory disturbance([Sensory=noj), then the focus of CVD is Thalamus. 

Interestingly, LOC(loss of consciousness) under the condition of [Sex = 
Female] A [FI emiparesis = Left] was found to be an important factor to di- 
agnose thalamic damage. In this domain, any strong correlations between these 
attributes and others, like the database of meningitis, have not been found yet. 
It will be our future work to find what factor is behind these rules. 



5.3 Rule Discovery as Knowledge Acquisition 

Expert System: RH. Another point of discovery of rules is automated knowl- 
edge acquisition from databases. Knowledge acquisition is referred to as a bot- 
tleneck problem in development of expert systems m which has not fully been 
solved and is expected to be solved by induction of rules from databases. How- 
ever, there are few papers which discusses the evaluation of discovered rules from 
the viewpoint of knowledge acquisition HS|. 

For this purpose, we have developed an expert system, called RH(Rule-based 
system for Headache) by using the acquired knowledgejl RH consists of two 
parts. Firstly, RH requires inputs and applies exclusive and negative rules to 
select candidates (focusing mechanism). Then, RH requires additional inputs 
and applies positive rules for differential diagnosis between selected candidates. 
Finally, RH outputs diagnostic conclusions. 

Evaluation of RH. RH was evaluated in clinical practice with respect to its 
classification accuracy by using 930 patients who came to the outpatient clinic 
after the development of this system. Experimental results about classification 
accuracy are shown in Table 4. The first and second row show the performance 
of rules obtained by using PRIMROSE-REX2: the results in the first row are 
derived by using both positive and negative rules and those in the second row are 
derived by only positive rules. The third and fourth row show the results derived 
by using both positive and negative rules and those by positive rules acquired 
directly from a medical experts. These results show that the combination of 
positive and negative rules outperforms positive rules and gains almost the same 
performance as those experts . 

6 Conclusions 

In this paper, the characteristics of two measures, classification accuracy and 
coverage are discussed, which shows that both measures are dual and that accu- 
racy and coverage are measures of both positive and negative rules, respectively. 

® The reason why we select the domain of headache is that we formerly developed 
an expert system RHINOS (Rule- based Headache INformation Organizing System), 
which makes a differential diagnosis in headache In this system, it takes 

about six months to acquire knowledge from domain experts. 
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Table 4. Evaluation of RH (accuracy: averaged) 



Method Accuracy 

PRIMEROSE-REX2 (Positive and Negative) 91.4% (851/930) 
PRIMEROSE-REX (Positive) 78.5% (729/930) 

RHINOS (Positive and Negative) 93.5% (864/930) 

RHINOS (Positive) 82.8% (765/930) 



Then, an algorithm for induction of positive and negative rules is introduced. 
The proposed method was evaluated on medical databases, the experimental re- 
sults of which show that induced rules correctly represented experts’ knowledge 
and several interesting patterns were discovered. 
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Abstract. Association rules have been popular in theory, though it is 
unclear how much success they have had in practice. Very many associa- 
tion rules are found in any application by any approach and they require 
effective pruning and filtering. There has been much research in this 
area recently, but less with the goal of providing a global overview and 
summary of all rules, which may then be used to explore the rules and 
to evaluate their worth. The unusual feature of association rules is that 
those with the highest objective values for the two key criteria (support 
and confidence) are not usually those with the most subjective interest 
(because we know the obvious results already). The TwoKey plot is a way 
of displaying all discovered association rules at once, while also providing 
the means to review and manage them. It is a powerful tool in order to get 
a first overview of the distribution of confidence and support. Features 
such as separate groups of rules or outliers are detected immediately. By 
exploiting various ancestor relationship structures among the rules, we 
can use the TwoKey Plot also as a visual assessment tool, closely related 
to pruning methods - e.g. those proposed by Bing Liu (1999). The con- 
cept will be illustrated using the interactive software MARC (Multiple 
Association Rules Control). 



1 Introduction 

Analysis by association rules is one of a number of techniques in the field of Data 
Mining, the study of very large data sets. Simple methods like Association Rules 
are popular because they can be applied to such large data sets, while some 
more traditional statistical methods do not scale up. It is typical of Data Mining 
analyses that very many results are produced, not only through the application 
of a single method but also because several different techniques may be applied. 
Tools for organising and managing large numbers of results are very necessary 
and some of the ideas in this paper will apply to results from other Data Mining 
approaches as well. 

Association rules were introduced formally by Agrawal et al (1993) and have 
been discussed widely since (see Bruzzese & Davino (2001) or Hofmann & Wil- 
helm (2001) for recent references). They are a simple approach to analysing the 



L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 472-^^^ 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



The TwoKey Plot for Multiple Association Rules Control 473 



association between large numbers of categorical variables and part of their at- 
traction is that the method is easy to explain and that individual results are 
readily understandable. What is not so easy to deal with is that for any realistic 
data set there are two main problems: firstly, that far too many association rules 
tend to be reported and most software packages just display them all in a fairly 
incomprehensible long list (see Tabled for the first lines from a typical example); 
secondly, that the obvious way round the first problem “to just report the ‘best’ 
rules discovered” doesn’t work. Association rules that are best on the two key 
criteria of support and confidence are likely to be trivial or well known already. 
The most interesting rules will tend to be ones, which have good, but not out- 
standing, values for support and confidence. Unfortunately, there will usually be 
a large number of these. There has been research on pruning and filtering sets 
of rules, but while these methods look promising they still leave many rules to 
be evaluated. Background information has to be brought into play and it makes 
sense to try to concentrate on discussing a small group of rules. Some of these 
will have high criteria values, but be of no practical application. What remains 
will ideally be of value to the data set owners. 



Table 1. Sample output from association rules software. (Only the first four rules 
found are shown, there are many more to follow!) 

age>33 & weeks worked in year>8 — >■ income=50000-|- 

[Cov=0.307 (91870); Sup=0.051 (15308); Conf= 0.167; Lift=2.69] 
age>33 & wage/hour<0 & weeks worked in year>8 — >■ income=50000-|- 
[Cov=0.277 (82808); Sup=0.049 (14634); Conf=0.177; Lift=2.85] 
age>33 & hispanic origin=All other & weeks worked in year>8 — >■ income=50000-|- 
[Cov=0.276 (82712); Sup=0.049 (14585); Conf=0.176; Lift=2.84] 
age>33 & wage/hour<0 & hispanic origin=All other & weeks worked in year>8 

— >• income=50000-|- 

[Cov=0.248 (74276); Sup=0.047 (13931); Conf= 0.188; Lift=3.02] 



2 Analysis by Association Rules 

It is assumed that the data set comprises 0/1 variables reflecting the absence 
or presence of attributes in transactions. For non-binary data we will assume 
an appropriate recoding. An association rule lAgrawal et ah, 1993| is defined as 
an implication of the form X ^ Y , where X and Y are mutually exclusive 
itemsets. An association rule X — >■ U holds with confidence c = c{X,Y), if c% 
of transactions in the data set that contain X also contain Y . X ^ Y has 
support s = s(X, y) in the data set, if s% of transactions contain X and Y. 
Confidence is equivalent to the conditional probability of Y given X. Support 
is equivalent to the joint probability of X and Y . In this paper we shall restrict 
ourselves to rules with single item conclusions (i.e. the itemset Y on the Right 
Hand Side always contains just one item) for expository purposes. Simple rules 
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with just one assumption (when itemset X on the Left Hand Side has only one 
member) will be referred to as level two rules, because two items are involved. 
By extension, a rule with (m— 1) assumptions on the LHS will be a level m rule. 

All Association Rules software will produce some kind of listing of rules 
(as in Table ^ and most packages will also provide summaries. Few appear to 
offer substantial organisation and management capabilities. To emphasise how 
important this is, it is worth noting that 5807 rules of between 2 and 8 levels were 
found for the 34 variable data set considered in this paper, even though support 
had to be more than 60% and confidence higher than 80%. The MARC software 
produces an initial summary window giving the parameter values chosen, the 
number of rules found (in total and by level) and a listing of the individual 
variables with their frequency of occurrence and the numbers of rules in which 
they are involved as either assumption or conclusion. There is also a set of 
controls to enable flexible selections of variables for detailed analysis and the 
capability to sort on any of the numbers produced. Sorting is a much-underrated 
facility in statistical software. That the rules should be sorted on confidence and 
support is obvious, but sorting on variables is also informative for grouping rules 
with the same variables as assumptions or conclusions. 



3 Data Set Description 

Association rules are supposed to be ideal for analysing market basket data, 
that is customer shopping transactions. Which products are bought together is 
an important question for store management. These data sets have large numbers 
of products (just think of how many different products your local supermarket 
stocks) and almost all are bought relatively rarely. Either the method is success- 
ful, but the results have not been released for reasons of confidentiality, or there 
has been less practical application of this kind than theoretical development. 

Like most other researchers in this field we have therefore not analysed a 
real market basket data set (there are several analyses of artificial data sets 
in the literature) but have experimented with applying the method of associa- 
tion rules to another kind of binary data. The data set contains information on 
birdsbreeding habits by geographic region IBuckland et ah, 19901 . 395 regions 
in North East Scotland were classified as breeding or non-breeding grounds for 
each of 34 different kinds of bird. There is interest in which groups of birds breed 
in the same regions. Do some birds share the same breeding grounds? Does the 
breeding presence of certain birds imply that a further species breeds there too? 
While this data set is not large and does not have a large number of variables, 
it is a real data set and it is both interesting to see what association rules might 
suggest and also as an example to illustrate how large numbers of rule results 
can be displayed and analysed. 

One unexpected result of the analysis was the realisation that not only is 
effective post-filtering essential to cut down the number of rules discovered to a 
manageable group, but that pre-filtering is necessary as well. The high support 
threshold chosen immediately excluded all birds, which bred in less than 60% of 
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the regions from being in any of the rules, whether as assumption or conclusion. 
In many ways it is the rarer birds, which are more interesting. Setting the support 
threshold much lower and then selecting the rules in which the rarer birds were 
conclusions would always be possible, but huge numbers of rules of no interest 
would be produced. It is more practical to both lower the threshold and to have 
the option of excluding the more frequent birds from appearing as conclusions 
in any rules. 

4 What Is a TwoKey Plot? 

A TwoKey plot shows the two keys of confidence and support for all discov- 
ered association rules. Its lower limits are determined by the threshold criteria 
specified by the user. In Fig. ^ these were a minimum support of 60% and a min- 




Fig. 1. A scatterplot of all 5807 rules found. 
No density estimation has been used. In the 
colour version the points are coloured by 
the level of the rule and this is partially 
seen in the shading 



confidence 




50 60 70 80 support 



Fig. 2. MANET plot of confidence 
against support with pointsize = 2 and 
maximum brightness at pixels with 12 
or more points 



imum confidence of 80%. The maximum values are, of course, potentially 100% 
and this may be reached on the V axis (confidence) . It is more than a standard 
scatterplot because it has extensive interactive features, but it is more than an 
interactive scatterplot, because it has tools specifically designed for association 
rules. A TwoKey plot includes basic interactive features (see Unwin [1999]) such 
as querying (to find out which rules are represented) , zooming (to study a subset 
of rules in more detail), selecting (to highlight rules of certain types) and linking 
(to garner information from other displays). There are also density estimation 
tools and line connections to display relationships between rules. 

Several features can be identified in Fig. ^ The darker points on the right 
are all 2-level rules. They form straight lines because the LHS assumption is the 
same in all cases. The reason for this is simple. For an association rule X ^ Y 
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the fraction of confidence and support is constant in X, i.e. for each LHS X, 
there is a straight line, on which all data points X ^ Y appear with increasing 
confidences (starting from the origin) . 

In Fig. ^there are three blocks of rules, which have the appearance of clouds 
moving to the left leaving exhaust trails behind them. Rules involving a partic- 
ular variable as conclusion will sometimes be in just one of these blocks. This 
arises when there is at least one good 2 level rule (i.e. good for that conclusion) 
and adding further assumptions lowers support but adds little to confidence. 

The colour coding of rule levels is very informative in an initial overview. Low- 
level rules tend to be to the right (high support) and lower (less confidence), but 
individual rules of different types then stand out, especially higher-level rules 
with more support. To get an impression of how many rules are located in 
different areas of the plot, a density estimation version of the display is needed 
(without therefore the colour coding of levels) . This cannot yet be carried out in 
MARC, but is available in the general interactive graphics software, MANET, 
written by Heike Hofmann [Hofmann, 2000b] . Figure 0 shows the same plot as 
Fig-dbut in MANET and with a fast (though crude) density estimate. There are 
two parameters controlling the density estimation, the size r of individual points 
and an upper limit L on the density represented. In Fig. Q r = 2 and L = 12. 
Each pixel at which 12 or more points overlap is coloured bright white and lower 
densities are shaded in a proportional fashion. (At many locations in the top left 
hand corner of the plot there are substantially more than 12 rules, as is readily 
found by direct querying of the locations). Although the method is simple, it is 
fast enough to be interactive so that changing r and L reveals different aspects 
of the density of rules. This is particularly useful when examining different parts 
of the plot. Parameter values suitable for high density areas leave lower density 
areas looking quite uniform and vice versa. 

TwoKey plots can be used as standard interactive displays, where they may 
be linked to other graphic displays of the data: a barchart of level (for the same 
reason as the colour coding above), barcharts of the individual variables (to show 
in which groups of rules variables are involved) or mosaic plots [Hofmann, 2UUU^ 
of combinations of the variables (to select rules with certain combinations or, 
more usefully, to show which combinations arise in a selected group of rules). 
Linking is a powerful two-way tool. The information it delivers depends on the 
diagram where you make the selection and which diagrams you link to. Graphical 
interaction provides an easily understandable interface to the results for smaller 
numbers of variables. For working with larger numbers, the control window in 
MARC is more suitable, where many variables can be selected simultaneously. 
It could be argued that detailed interactive approaches are most relevant when 
studying subsets of results after extensive filtering, but both kinds of selection 
facilities should be provided. Using subjective filtering (e.g. just display the 
rules with a particular conclusion or display all rules involving any of a subset 
of variables) is an effective exploratory tool, but remains subjective. Selection 
rules based on objective filtering criteria must still be added, but it is not yet 
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clear which. There are several promising alternatives and they need to be put 
into a common framework. We envisage a three-part process: 

1. Draw the TwoKey plot for all rules to gain a global overview. 

2. Apply one or more “objective” filtering rules to reduce the number of rules 
under consideration. 

3. Use subjective selection tools to uncover the most interesting rules amongst 
the remainder. This would not be a rigid three-stage process, but a continuing 
exploration to identify rules and subgroups of rules that are worth following 
up. 

In an interactive environment of linked graphics (Becker et al 1987, Unwin 
et al 1996, Theus 1996, Wilhelm 2000), further information can be brought into 
consideration using selection & highlighting. 

In this data set, spatial information about the items is available. We can 
extract two rules and have a closer look at them. The rules Mallard — >■ Pheasant 
(54.4, 81.4) and Blue Tit, Curlew, Great Finch, Oystercatcher, Wren — >■ Pheasant 
(50.1, 97.1) are shown on the corresponding map (see Fig. El- 





Ob. 



X and not Y not X and not Y 0 _ 



jMm. 









W 

Blue Tit, Curlew, Great Finch, 
Oystercatcher, Wren -> Pheasant 



Fig. 3. Maps of the bird watching area. Gray shading corresponds to different combi- 
nation of presence/absence of left and right hand side of rules 



Gray shading is used to encode different combinations of left and right hand 
side of two rules. Areas, in which both birds of left and right hand side of the 
rules have been observed are coloured middle gray, areas, in which only birds 
from the right hand side but not of the left hand side have been observed are 
coloured black. The darkish gray represents all those areas where birds of the 
right hand side are present, but not all birds of the left hand side have been 
observed. 

In this example you can see, that the simple rule Mallard — >■ Pheasant shows 
some spatial relationship: several areas in the south-eastern part of the map are 
falsely assigned to Pheasant (marked by the circled 1), whereas the rule misses 
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out some areas in the centre (marked by 2). This problem is overcome by the 
more specific rule shown on the right of Fig. 0 The areas in the south-eastern 
part are fitted nicely by it and the rule detects most of the areas in the centre, 
too. 

Using interactive graphics opens the way to find structural behaviour in 
association rules, e.g. separate spatial clusters or subgroups among the target 
population. 



5 Special Tools in TwoKey Plots for Analysing 
Association Rules 

5.1 Children and Descendants 

Given a typical simple two level rule that X — >■ F, it is clearly worth investigating 
what happens when more conditions are added to the left hand side. X — >■ F 
could be an obvious relationship and we might be interested in seeing how it 
might be extended. In MARC’s TwoKey plot you can select any rules (not just 
two level rules) and have lines drawn from each rule to its immediate children. 
A child is defined as having one more condition on the left hand side. Figure El 



K = Support / Y = Confidence 




Fig. 4. The children of a 3-level rule have been linked by lines to the rule 



shows a plot of this kind with one three level rule selected. Naturally, only rule 
children are shown which have sufficient confidence and support values to appear 
in the plot. It is immediately clear which additional group of rules it might be 
worth looking at if we start from the selected rule. All rule children must lie to 
the left (they cannot have higher support) and only rules with higher confidence 
will be worth considering. There are likely to be two kinds: a few rules which have 
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somewhat higher confidence, but only a little less support and some rules which 
have a lot less support, but rather more confidence. It very much depends on the 
client’s goals and the particular rules, which group might be more important. 
We will have a closer look at this situation in Sect. 0 In Fig. 2|we can see that 
only 3 children are worth considering and they have very different reductions in 
support. Note that the display would also make clear if there were a number of 
closely related alternatives or if one rule child was substantially better than the 
others. It is possible in MARC to display descendants of 2 or more generations, 
but the display rapidly becomes unclear. It is a topic of current research to see 
how this might be improved. 

5.2 Parents and Ancestors 

Just as it can be informative to study rule descendants, it is helpful to be able 
to examine rule predecessors. For a single m level rule there can be at most (m - 
1) parents. Lines may of course be drawn in MARC from selected rules to their 
parents. Parents must lie to the right (higher support). A rule with a parent 
with higher confidence will not be a good rule. There is an argument for filtering 
out all such rules, but, initially at any rate, it seems appropriate to display all 
rules that pass the user-specified thresholds. 

5.3 Neighbours 

Individual association rules are rather blinkered. They tell you nothing about 
closely related rules. Relatives cover one form of closeness, but the term neigh- 
bour is used in MARC to cover all others. 

A rule R2: X' — >■ F' is a neighbour of distance d of another rule Rl: X ^ Y 
if one of the following holds (where a step is an addition or a deletion) : 

(a) the RHS is the same {V = Y) and the LHS’s are d steps apart; 

(b) the RHS is different and the LHSs are d — 2 steps apart. 

With this definition the only immediate neighbours (d = 1) are parents and 
children. Rules with only the RHS different are neighbours of distance 2. 

Linking neighbours of a selected rule by lines in MARC helps in various ways. 
For instance, for distance d = 2, the rules on a straight line are those with the 
same assumptions but different conclusions. The other rules of the same colour 
have the same conclusion but one different assumption. The remaining rules 
of different colours have the same conclusion but either two fewer assumptions 
(grandparents) or two more (grandchildren). 

MARC produces tables of results for the selected subgroup of rules to provide 
a textual overview to complement the graphic display. Further developments 
will include the incorporation of statistical tests in these tables to identify rules, 
which are ’’significantly” different from the originally selected rules, and this may 
also be colour-coded on the lines in the graphic display. Significance would only 
be used as a rough guideline. The interdependence of the tests and the multiple 
testing involved precludes anything more, but it is valuable to attempt to assess 
relationships in a formal statistical way. 
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6 Relationships among Rules 

Selecting rules from the set of results is a crucial issue, with the help of inter- 
active selection methods and linking we have a lot more choices for criteria of 
acceptance, all of which may be sensible in the background of a specific applica- 
tion. 

Of particular interest is, how we may compare rules of the form X\ — >■ Y 
and X2 — )> Y, where the first rule has higher confidence, but less support than 
the second rule. Is there a way to tell, which of these rules is “better” than the 
other? 

Between each two of the rules in fig. that fulfill an ancestor relationship a 
line is drawn. 




7.5 15.0 22.5 30.0 

supp 



Fig. 5 . Lines between each ances- 
tor/successor pair of rules 




-0 20 40 60 

factor 



Fig. 6 . Slope si between two related 
rules vs. their test-statistic T 



For these lines, the angle between a rule and its successor is of interest: it is 
clear, that an angle < 0° connects a rule to a successor with less confidence - 
something, which marks the successor rule as a total failure. Lines with a very 
steep angle, on the other hand, show those successors, which have only little less 
support than their ancestor but more confidence. 

Can the angles of these lines therefore be exploited for a visual pruning mecha- 
nism? For this, we have a look at the (negative) slope si between X — ^ Z(si, ci) 
and its successor X (lY Z {32,02) '■ 

si = tan(a) = — — 

Si - S2 

This expression for the slope between two related rules recalls another test- 
statistic, which is used widely for pruning rules based on their ancestor relation- 
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ship structure |hling Liu et al., 1999|: Let T be the test statistic of P{Z \ X,Y) 
vs P{Z I X,^Y): 

T = ./nP(X) P{Ynz\x)-p{Y\x)P{z\x) 

^ ^ ' y/P{Z\X)P{^Z\X)-P(Y\X)P(^Y\X) 

T is approximately normally distributed (it’s the square root of a test of 
conditional independence of y and Z given X). Prominent features in Fig. 0 are 
the strong linear relationship between the lower values of the statistics and the 
cloud of points with high slope. Within this cloud of points, T and si do take 
very different values. The two rules mined from the SAS Assoc Data, herring & 
baguette — >■ heineken and chicken & coke & sardines — >■ heineken, have similar 
values in T, but different slope values: 

rule ancestor T si 

herring&baguette — >■ heineken baguette — >■ heineken 10.72 4.13 

chicken&coke&sardines — >■ heineken chicken&coke — )> heineken 10.82 71.41 

In order to explain the differences in T and slope, we rewrite the slope as: 

si = \P(Y f^Z\X)- P(Y I X)P(Z I X)] ■ — — ^ -. 

^ ^ ^ t I ; V I ;j p(^Y\X)P{XnZn^Y) 

The first term of this product gives a measure for the conditional independence 
of Y and Z given X and is well known from the test statistic T. The second 
term is more tricky: P{Y\X) will be large, if P{X fl -'F) is small; this measures 
the loss in the support of X — >■ F by adding the item Y. P{X nZn -^Y) is linear 
in the number of result hits that are thrown away by adding Y. 

Some properties of the slope: 

— according to the value of the slope that rule X C\Y Z is chosen, which 
cuts off small slices from the support while gaining maximum confidence. 

— It ignores the values of the ancestor rule X ^ Z. 

— It’s not additive: 

sis < sli + sl 2 , 

where sl± is the slope between rules X ^ Z and — >■ Z, sl 2 is the slope 
between rules X C\Y ^ Z and X ^ Z and sis is the slope between rules 
X C\Y ^ Z and — >■ Z, 

For each ancestor rule, the results of si and T show a strong linear relationship 
(cf. Fig.Q. 

Conclusion: for single ancestor rules the results from the test statistic and the 
slope are approximately linear dependent. The angles between the ancestor rule 
and its successsors therefore may be used as a measure for the strength of a 
successor rule. This leads to the following criterion: A rule X H Yi — Z is 
dominated by a rule X C\Y 2 ^ Z, 

4=^ sl{X n Yi ^ Z) < sl{X HY 2 ^ Z) 

P{Z \ X,Y2) - P{Z \ X) P{Z\X,Ys)-P{Z\X) 

p{z n X) - p{z n X n Fa) ^ p{z n x) - p{z n x n Fi) ’ 
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Fig. 7. Linear dependencies between si and T for fixed ancestors 



7 Summary 

The concept of TwoKey Plots is a very simple but very effective one. It has 
been introduced as a means of providing a global overview of large amounts of 
association rules discovered in an analysis and of providing the tools to explore 
and evaluate them. The range and flexibility of the resulting graphic displays 
make this a very attractive method for working with association rules. 

Data Mining methods can produce many different outputs and these need 
to be organised, managed and filtered. Pruning or filtering methods are an im- 
portant first step, but they leave many rules still to be assessed. Exploiting 
the ancestor relationship structure of rules gives a visual approach to filter out 
potentially interesting rules. Here, we have seen that a visual analysis of the stan- 
dard ancestor-successor relationship coincides with a x^-test of independence as 
proposed by Bing et al (1999). 

The effective application of any method of analysis lies in the combination 
of objective results and subjective knowledge. The TwoKey plot is a step in this 
direction for association rules. Its particular strength lies in interactivity, so that 
users can incorporate both their background knowledge and their foreground 
goals in evaluating the rules that have been discovered. 

Software. The MARC (Multiple Association Rules Control) software used for 
the work in this paper is being developed by the authors at Augsburg. Klaus 
Bernt is responsible for the system design and programming. MARC is written 
in Java and it is planned to make the software available to other researchers 
in the near future. (Check our website wwwl.math.uni-augsburg.de for further 
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details.) Like all Augsburg software, MARC is named after a painter close to the 
Impressionists. Franz Marc was born in Munich, near to Augsburg, and was an 
influential member of the Blaue Reiter group. 
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Abstract. A lightweight method for collaborative filtering is described 
that processes binary encoded data. Examples of transactions that can 
be described in this manner are items purchased by customers or web 
pages visited by individuals. As with all collaborative filtering, the objec- 
tive is to match a person’s records to customers with similar records. For 
example, based on prior purchases of a customer, one might recommend 
new items for purchase by examining stored records of other customers 
who made similar purchases. Because the data are binary (true-or-false) 
encoded, and not ranked preferences on a numerical scale, efficient and 
lightweight schemes are described for compactly storing data, computing 
similarities between new and stored records, and making recommenda- 
tions tailored to an individual. 



1 Introduction 

Recommendation systems provide a type of customization that has become pop- 
ular on the internet. Most search engines use them to group relevant documents. 
Some newspapers allow news customization. E-commerce sites recommend pur- 
chases based on preferences of their other customers. The main advantages of 
recommendation systems stem from ostensibly better targeted promotions. This 
promises higher sales, more advertising revenues, less search by customers to get 
what they want, and greater customer loyalty. 

Collaborative filtering pQ is one class of recommendation systems that mim- 
ics word-of-mouth recommendations. An related task is to compare two peo- 
ple and assess how closely they resemble one another. The general concept of 
nearest-neighbor methods, matching a new instance to similar stored instances, 
is well-known |2j. Collaborative filtering methods use this fundamental concept, 
but differ in the how data are encoded, how similarity is computed, and how 
recommendations are computed. 

We describe a lightweight method for collaborative filtering that processes 
binary-encoded data. Examples of transactions that can be described in this 
manner are items purchased by customers or web pages visited by individuals. 
As with all collaborative filtering, the objective is to match a person’s records 
to customers with similar records. For example, based on prior purchases of a 
customer, one might recommend new items for purchase by examining stored 



L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, pp. 484-^^^ 2001. 
© Springer- Verlag Berlin Heidelberg 2001 



Lightweight Collaborative Filtering Method for Binary-Encoded Data 485 



records of other customers who made similar purchases. Because the data are 
binary (true-or-false) encoded, and not ranked preferences on a numerical scale, 
efficient and lightweight schemes are described for compactly storing data, com- 
puting similarities between new and stored records, and making recommenda- 
tions tailored to an individual. Our preliminary results are promising and com- 
petitive with published benchmarks. 



2 Methods 

The collaborative filtering problem we looked at is a generic task that occurs in 
many applications. A specific example is recommending pages to a user surfing 
the web. We shall describe our algorithm within the context of this application 
where the attributes are the pages visited by users (and so we shall use “page” 
and “attribute” interchangeably) . The value-attribute for a user simply records 
whether or not the user visited the corresponding page. Hence the data we are 
concerned with is purely binary. 

In the simplest scoring scheme, recommendations might be made based on 
a linear weighted combination of other people’s preferences. Most popular are 
memory-based methods. Complex model-based approaches such as Bayesian net- 
works have also been explored |2j. We examined variations of a far simpler 
scheme. The basic idea is as follows: 

1. find the k nearest neighbors to the new (test) case 

2. collect all attributes of these neighbors that don’t occur in the test case 

3. rank these attributes by frequency of occurrence among the k neighbors. 

In measuring distance between cases, we compute a score that measures 
similarity, the higher the score, the greater the similarity. For each training case, 
count the number of positive attributes in common with the test case. We add 
a small bonus for each page: the reciprocal of the number of training cases 
in which this page appears. This ensures that rare pages get a higher bonus 
than ’’popular” pages. As an example, suppose a new example has visited 5 
webpages. We look at the stored examples to find similar examples. The most 
similar examples would match in all 5 webpages. Their scores would be 5 plus 
the value of the pre-computed bonuses for each of those pages. The score would 
ignore any negative distance that could be computed from pages visited in the 
stored examples but not visited by the new example. 

Further improvements might be obtained by modifying step 2 slightly and 
splitting an example’s vote. Instead of considering all attributes of all neighbors 
equally, we instead assign 1 vote for each neighbor, and split that vote among 
its attributes. The intuition behind this is that if a neighbor makes only one rec- 
ommendation, it is more important than a recommendation made by a neighbor 
that makes 10 recommendations. This affects the frequency of occurrence and 
alters the ranking of the recommended attributes in the ranked list. 

Another change we consider is to also measure the degree of similarity among 
neighbors’ scores. Attributes of closer neighbors might be assigned a higher 
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weight than those far away. We do this by using a weight of 1 for the clos- 
est neighbor(s), and a proportionally smaller weight for neighbors further away. 
The weight is the ratio of the score of a case to the highest score. 

The overall algorithm is shown in Fig. QJ It follows the 3 steps listed earlier. 
Scoring a case involves computing a function pv{j) for each attribute. This func- 
tion measures the apriori predictive value of the attribute and is computed once 
at the start. Note the two modifications to the basic algorithm corresponding to 
vote splitting and relative similarity. 

In order to compute the scores and tally the frequencies efficiently, the sample 
of cases are stored as follows: 

— Case List. Here the cases are stored sequentially as a series of numbers 
corresponding to the positive attributes in the case. All cases are stored in a 
single vector, with another vector pointing to the start of each case. Tabled 
shows an example of how the cases are implemented using two lists. The 
first case consists of positive attributes 2 and 5. The second case consists of 
positive attributes 1, 7, 21 and 43. The third case begins with the attribute 
2 . 



Table 1. Example of case list implementation 



case vector 


2 5 1 7 21 43 2 ... 


start vector 


13 7... 



— Inverse List. Here we record a series of numbers corresponding to the 
cases in which a specific positive attribute occurs. All attribute mappings to 
examples can be stored sequentially in a single vector, with another vector 
pointing to the start of each attribute. Table|2|shows an example of how this 
is done. The first attribute occurs in cases 2 and 50, the second attribute 
occurs in cases 1, 3 and 45. 



Table 2. Example of inverse list implementation 



inverse vector 


2 50 1 3 45 ... 


attribute start vector 


13 6... 



The case list is used to compute the frequencies; the inverse list is used to 
compute the scores. Both the lists are computed once, at the start. Following 
that, the scores and frequencies can be computed very efficiently. 
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Input: C {new case represented by M attributes C(l), ... C(m)}, 

D {Historical data of n cases, D1 ... Dnh 
Output : A {Ranked list of attributes}- 

Begin 

For j = 1 to m do 

df = number of cases in D where attribute j appears; 
pv(j) = 1 + 1/df; 

done 

score(Di) = 0 for i=l,n 
rank(j) = 0 for j=l,m 
For j = 1 to m do 

If (C(j) == 0) continue; 

// examine only attributes that are positive for the new case 
For i = 1 to n do 

If (Di(j) == 0) continue; 

// score a case only if it shares an attribute with new case 
score(Di) += pv(j); 
done 

done 

T = select k cases with highest scores in D; 

For j = 1 to m do 

If (C(j) == 1) continue; 

// examine only attributes that are NOT positive for new case 
For i = 1 to k do 

// increase count of those attributes that are in top-k cases 
If (Ti(j) == 1) 
rank ( j ) -H= 1 ; 

// MODIFICATION 1 (vote splitting): rank(j) -H= l/sum(Ti); 

// where sum(case) is the number of positive attributes 
in case 

// MODIFICATION 2 (relative similarity): rank(j) += 

(1/ sum(Ti) ) *xscore (Ti) 

// where xscore(Ti) = score(Ti)/score(Tl) 

done 

done 

Output = small subset of attributes with highest rank(j);; 

End 



Fig. 1. The lightweight collaborative filtering algorithm 
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3 Results 



Our work is a followup to |2| which reports on empirical results using Bayesian 
networks and memory-based methods for three datasets. Our experiments were 
performed on the ms web dataset. 

The problem we looked at is a generic task that occurs in many applications: 
recommending pages to a user surfing the web. The msweb data of users visiting 
pages at the Microsoft website can be viewed in this manner. The data is pre- 
divided into 32711 training cases and 5000 test cases, and 298 attributes. A case 
can be seen as a user and an attribute can be seen as a webpage visited by the 
user. report a variety of experiments in which some attributes in the test 
cases are predicted by models based on the training data. One of the scenarios 
is as follows: in each test case, a visited page is randomly selected and ’’hidden”. 
Based on the other attributes, the models learned from the training data attempt 
to recommend pages to visit. The models are evaluated by assessing how well 
they do in predicting the ’’hidden” page. Since our experiments require that test 
cases have 2 or more visited pages, the test cases used are a subset of the full 
set - 3453 cases. 

Since the models typically make a ranked list of recommendations, 0 use 
an R-metric to measure the quality of the recommendation. The R-metric is 
specified in Equations [H and El Here, Ra is the expected utility measuring how 
likely it is that the user will visit an item on the ranked list, with an exponential 
decay in likelihood for successive items, and is based on the user a’s votes on 
item j. is the maximum achievable utility and normalizing with it helps us 

consider results independent of sample sizes and the number of recommendations 
made. The higher the rank of the ’’hidden” page in the list of recommendations, 
the higher the value of the R-metric. 



max{vaj - d,Q) 

3 



R=100 



Eg^r 



( 2 ) 



Table 0 summarizes the results obtained for several Model-based and Mem- 
ory-based methods. The static ranked list simply recommends based on popu- 
larity and without considering the known votes. As such it serves as a baseline. 
Bayesian Clustering and Bayesian Networks are relatively complex Bayesian 
models 0. The Correlation method [Sj is further enhanced by the use of in- 
verse user frequency, default voting and case amplification. Vector similarity jS] 
are enhanced by the use of inverse user frequency transformations as well. The 
three versions of the lightweight algorithm differ in the rank computation as 
described in Fig. 0 Except for the two Bayesian methods, all the other methods 
are memory-based methods. 

The results in Table 0 are for the best variation of each method. Clearly 
the parameters of each method need to be optimized carefully. The performance 
of the lightweight method depends on k, the number of neighbors used. Fig. 0 
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Table 3. Summary of results on MSWEB data 



Collaborative Filtering Method 


R-metric 


Static Ranked List 


49.77 


Bayesian Clustering 


59.42 


Vector Similarity 


61.70 


Correlation 


63.59 


Bayesian Network 


66.69 


Basic Lightweight 


61.76 


Lightweight with vote splitting 


64.35 


Lightweight with vote splitting and relative similarity 


64.60 



Table 4. Top x recommendations 



X 


1 


5 


10 


accuracy 


.3165 


.7387 


.8412 



shows the result of varying k for each of the three variations. For a vote-splitting 
strategy, higher values of k are helpful. For the basic algorithm,, beyond a lesser 
number of k, the performance gains are marginal. 




5 25 50 100 150 200 400 

k neighbors 



Fig. 2. Effect of k on performance of the lightweight method 
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4 Discussion 

The experiments suggest that our lightweight collaborative filtering method 
can be competitive with published results for far more complicated models. A 
lightweight algorithm, analogous to our basic collaborative filtering algorithm, 
has been extensively tested for information retrieval and document matching 
0. An earlier study showed that in an IR environment, where more than one 
recommendation is made, simpler algorithms can be surprisingly effective jS]. 

We relied on published results for the alternative, highly specialized col- 
laborative filtering methods. Our method is restricted to applications that are 
represented in binary form. Most real-world collaborative filtering methods ex- 
pect ranked data, where a user is asked to rate products, for example on a scale 
of 1 to 5. Far fewer benchmark datasets for binary data are publicly available. In 
the future, we can expect interest in binary representations to increase because 
greater automation is achieved without users manually assigned ratings. 

The paucity of publicly available datasets with binary encodings limited our 
evaluation to one well-known dataset. Clearly, this is a weakness of the evalu- 
ation of our method. In the original study 0, several scenarios were examined 
including fixing the number of given items to 2, 5 or 10, each simulating differ- 
ent levels of information. Still, these variations are all taken from the same data, 
and the simulations create an unnatural set of data, all with the same number 
of positive features. We chose to use the one testing variation, all-but-one, that 
encompasses the variable scenarios of the unmodified snapshot of users. 

Because memory-based methods, like our lightweight method, have complex- 
ity 0(n), where n is the number of examples, many researchers have shifted 
attention from instance-based techniques to model-based techniques 0 ^01 • We 
are not presenting our method as the best or as the most run-time efficient under 
all circumstances. Belief and dependency networks have many desirable proper- 
ties, along with issues of complex representations and training. Our lightweight 
method is trivial to implement, mostly processes vectors sequentially, and oper- 
ates in an environment where almost every attribute is measured as zero. These 
conditions lend themselves to relatively efficient implementations until the num- 
ber of instances grows very large. Even then, experimentation may show that 
predictive performance reaches a plateau at a relatively small sample size of N 
cases. For example, on this dataset, much more data were originally available, 
but experimental results did not improve much with more data j0| . The simplic- 
ity of the lightweight method along with its binary vector representation, and its 
application without formal training to derive an intermediate model, may make 
it a suitable for some recommendation systems, including those with very large 
numbers of attributes. 

Collaborative filtering matches an individual’s history to users with similar 
records. An alternative approach to recommendations is item-based [10]. There, 
the top matches of each item are independently precomputed and stored (as in 
book retailers like Amazon), and only this information is used for tailoring a 
recommendation. Such an approach is efficient, has shown some good results, 
but has less potential for personalized recommendations. 
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Many collaborative filtering algorithms, including our lightweight algorithm, 
ignore the sequence order of user-actions. Alternative algorithms, such as Markov 
models, can capture such information, which potentially can yield improved per- 
formance m However, such models can become extremely complex, especially 
in higher dimensions, where either a long sequence of actions is traced or the 
space of recommendations is very large. 

The R-metric is effective in measuring performance over a ranked list of 
recommendations. It rewards those recommendations that are correct and highly 
ranked, and it penalizes those recommendations that appear at the bottom of 
the list. We used the R-metric to facilitate comparisons to published results. 
Other metrics can also be examined. We measured the accuracy of finding the 
’’hidden” page among the top x recommendations. This is shown in Table El An 
advantage of this metric is that by examining the trend for different values of x, 
one can get a good idea of the quality of the solution. 

Like search engines, the lightweight collaborative filtering algorithm uses an 
inverted list for efficient processing of sparse data. It requires very little data 
preparation and has a compact codebase. As such, it may prove highly desirable 
for applications that naturally fit a binary-encoded data representation. 
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Abstract. Support vector machines introduced three important inno- 
vations to machine learning research: (a) the application of mathemati- 
cal programming algorithms to solve optimization problems in machine 
learning, (b) the control of overfitting by maximizing the margin, and (c) 
the use of Mercer kernels to convert linear separators into non-linear de- 
cision boundaries in implicit spaces. Despite their attractiveness in classi- 
fication and regression, support vector methods have not been applied to 
the problem of value function approximation in reinforcement learning. 
This paper presents three ways of combining linear programming with 
kernel methods to find value function approximations for reinforcement 
learning. One formulation is based on the standard approach to SVM re- 
gression; the second is based on the Bellman equation; and the third seeks 
only to ensure that good actions have an advantage over bad actions. All 
formulations attempt to minimize the norm of the weight vector while 
fitting the data, which corresponds to maximizing the margin in standard 
SVM classification. Experiments in a difficult, synthetic maze problem 
show that all three formulations give excellent performance. However, 
the third formulation is much more efficient to train and also converges 
more reliably. Unlike policy gradient and temporal difference methods, 
the kernel methods described here can easily adjust the complexity of 
the function approximator to fit the complexity of the value function. 
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Abstract. Data mining research has approached the problems of ana- 
lyzing large data sets in two ways. Simplifying a lot, the approaches can 
be characterized as follows. The database approach has concentrated on 
figuring out what types of summaries can be computed fast, and then 
finding ways of using those summaries. The model-based approach has 
focused on first finding useful model classes and then fast ways of fitting 
those models. In this talk I discuss some examples of both and describe 
some recent developments which try to combine the two approaches. 
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Abstract. Many graphics are used for decoration rather than for con- 
veying information. Some purport to display information, but provide in- 
sufficient supporting evidence. Others are so laden with information that 
it is hard to see either the wood or the trees. Analysing large data sets 
is difficult and requires technically efficient procedures and statistically 
sound methods to generate informative visualisations. Results from big 
data sets are statistics and they should be statistically justified. Graph- 
ics on their own are indicative, but not substantive. They should inform 
and neither confuse nor mystify. 

This paper will NOT introduce any new innovative graphics, but will 
discuss the statistification of graphics - why and how statistical con- 
tent should be added to graphic displays of large data sets. (There will, 
however, be illustrations of the Ugly, the Bad and the possibly Good.) 
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Abstract. This paper reports on a long-term inter-disciplinary research 
project that aims at analysing the complex phenomenon of expressive 
music performance with machine learning and data mining methods. The 
goals and general research framework of the project are briefly explained, 
and then a number of challenges to machine learning (and also to com- 
putational music analysis) are discussed that arise from the complexity 
and multi- dimensionality of the musical phenomenon being studied. We 
also briefly report on first experiments that address some of these issues. 



1 Introduction 

This paper presents a long-term inter-disciplinary research project situated at 
the intersection of musicology and Artificial Intelligence. The goal is to develop 
and use machine learning and data mining methods to study the complex phe- 
nomenon of expressive music performance (or musical expression, for short). For- 
mulating formal, quantitative models of expressive performance is one of the big 
open research problems in contemporary (empirical and cognitive) musicology. 
Our project develops a new direction in this field: we use inductive learning tech- 
niques to discover general and valid expression principles from (large amounts 
of) real performance data. The project, financed by a generous research grant 
by the Austrian Federal Government, started in early 1999 and is intended to 
last at least six years. The research is truly inter-disciplinary, involving both 
musicologists and AI researchers. We also expect to contribute new results to 
both disciplines involved, and our first experimental results show that this is 
realistic — for instance, in [26] both a new, general rule learning algorithm and 
some interesting, novel musical discoveries are presented. 

In recent years, there has been an increasing number of attempts, in the 
field of empirical musicology, to formulate quantitative, mathematical or com- 
putational models of (aspects of) expressive performance (e.g., [1, 12, 13, 16-21]). 
This work has produced a wealth of detailed hypotheses and insights, but has of- 
ten been based on rather limited sets of performance data (which were sometimes 
also produced under ‘laboratory conditions’). What distinguishes our project is 
the use of large amounts of ‘real-world’ data, and the application of inductive 
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learning methods to discover interesting and possibly novel patterns and reg- 
ularities in the data. In short, we aim at performing the most data-intensive 
investigations ever done in musical expression research. 

The purpose of the present paper is to give an overview of the project and 
its current state, and to discuss the challenges that this application problem 
presents to machine learning and knowledge discovery. In section 2, we explain 
the basic notions of expressive music performance. Section 3 sketches the general 
research framework of the project and briefly touches upon the enormous diffi- 
culties involved in data acquisition and preparation (an aspect often neglected in 
machine learning publications). Section 4 looks at the problem from a machine 
learning point of view and discusses some of the particular challenges posed by 
the complex nature of the target phenomenon. Section 5 briefly summarizes some 
interesting results obtained so far and talks about some of our ongoing research. 



2 Expressive Music Performance 

When played exactly as notated in the musical score, a piece of music would 
sound utterly mechanical and lifeless; it is both unmusical and physically im- 
possible for a musician to perform a piece with perfectly constant tempo, even 
loudness, etc. What makes a piece of music come alive (and what makes some 
performers famous) is the art of music interpretation, that is, the artist’s un- 
derstanding of the structure and ‘meaning’ of a piece of music, and his/her 
(conscious or unconscious) expression of this understanding via expressive per- 
formance: a performer shapes a piece by continuously varying important param- 
eters like tempo, dynamics (loudness), articulation, etc., speeding up at some 
places, slowing down at others, stressing certain notes or passages by various 
means, and so on. It is this shaping that can turn a lifeless piece of music into a 
moving experience, and that also makes both the composer’s and the performer’s 
ideas clear to the listener. What types of parameters are at a performer’s disposal 
partly depends on the instrument being played, but the most important dimen- 
sions are tempo and timing, dynamics (variations in loudness), and articulation 
(basically, the way successive notes are connected). 

Expressive music performance plays a central role in our current musical cul- 
ture, and musicologists are showing increased interest in understanding exactly 
what it is that artists do when they play music. Are there explainable and quan- 
tifiable principles that govern expressive performance? To what extent and how 
are ‘acceptable’ performances determined by the (structure of the) music? What 
are the cognitive principles that govern the production (in the performer) and 
the perception (in the listener) of expressive performances? And what does this 
have to do with how we experience music? 

Our project hopes to contribute to answering the first two of these questions. 
We collect precise measurements of performances by skilled musicians, and try 
to detect patterns and regularities (and intelligible characterizations of these) 
via inductive learning. As we also enable the computer to recognize structural 
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aspects of the music, potential relationships between expressive patterns and 
musical structure should emerge naturally from these investigations. 

This approach is based on earlier work by the author [23,24], where it was 
shown that given some knowledge about musical structure, a computer can in- 
deed learn general performance rules that produce rather sensible ‘interpreta- 
tions’ of musical pieces. The central problem with these early studies was a lack 
of real performance data (the investigations were based largely on performances 
by the author himself) . In our current work, we go beyond this by working with 
large collections of performances by skilled musicians, recorded on special instru- 
ments (pianos) that precisely measure and record each action of the performer. 
Ideally, we would also like to study the performance style of famous artists, on 
the basis of, e.g., audio CDs, but that will depend on the availability of compu- 
tational methods for precise musical information extraction from audio, which 
is still an open problem in signal processing. 

3 The Project: A High-level View 

To give the reader an impression of the complexity of such a ‘real-world’ knowl- 
edge discovery project. Fig. 1 sketches the overall structure of our approach. 
As explained above, the basic goal is to take recordings of pieces as played by 
musicians, measure the ‘expressive’ aspects (e.g., tempo fluctuations) in these, 
and apply some machine learning algorithms to these measurements in order to 
induce general, predictive models of various aspects of expressive performance 
(e.g., a set of classification or regression rules that predict the tempo deviations 
a pianist is likely to apply to a given piece). These models must then be val- 
idated, e.g., by comparing them to theories in the musicological literature, by 
applying them to new pieces and analysing the musical quality of the resulting 
computer-generated performances, and, of course, by measuring their general- 
ization accuracy on unseen data. All this is sketched in the lower half of Fig.l. 

However, the story is much more complex. The problems involved in acquiring 
and pre-processing the data turned out to be formidable and forced us to develop 
a whole range of novel music analysis algorithms. And since we spent so much 
effort on these issues, I take the liberty of at least briefly mentioning them here. 



3.1 Data Acquisition 

The first problem was obtaining high-quality performances by human musicians 
(e.g., pianists) in machine-readable form. There are currently no signal process- 
ing algorithms that can extract the precise details of a performance from audio 
signals, so we cannot use sound recordings (e.g., audio CDs) as a data source. 
Our current source of information is the Boesendorfer SE290, a high-class concert 
grand piano that precisely measures every key, hammer, and pedal movement 
and records these measurements in a symbolic form similar to MIDI (though with 
higher precision). We did eventually manage to get large sets of performances 
that had been recorded on this instrument by a number of excellent pianists. 
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Fig. 1. The research framework: a sketch of data processing/analysis steps. 



For instance, we currently have performances of 17 complete piano sonatas by 
W.A. Mozart as played by a highly skilled concert pianist. This data set corre- 
sponds to some 5 1/2 hours of music and contains around 150.000 notes. We also 
have performances, by a famous Russian pianist, of essentially the entire piano 
works by Frederic Chopin (more than 9 hours of music, 300.000 notes, 2 million 
pedal measurements). This is a huge amount of data indeed; in fact, it is by far 
the largest collection of detailed performance measurements that has ever been 
compiled and studied in expression research. 

Another line of current research, which cannot be discussed here, concerns 
the extraction of performance information directly from digital audio data, e.g., 
audio CDs [9] (see top left corner in Fig. 1). This will eventually allow us to also 
study at least certain limited aspects of expression in arbitrary recordings by 
famous artists. 
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3.2 Data Preprocessing 

Preprocessing these data to make them usable for analysis and machine learning 
is a formidable task. What we need is not only the performances (i.e., informa- 
tion about how the notes were played), but also the notated music score (i.e., 
information about how the notes ‘should be’ played) and the exact note-to-note 
correspondence between the two. Manually coding musical scores consisting of 
tens of thousands of notes is not feasible; in order to get at the scores, we had 
to develop computational methods for extracting (re-constructing) the score in- 
formation from the expressive performances themselves. The result is a whole 
range of new algorithms for music analysis problems like beat induction and 
tempo tracking i.e., inferring the metrical structure of the piece in the face of 
(sometimes rather extreme) tempo changes [10, 11], quantization, i.e., inferring 
the ‘intended’ onset times and durations of notes in the underlying score [2], and 
inducing the correct enharmonic spelling of notes (e.g., G(t vs. Ab) [3], which is 
not merely an aesthetic issue, but absolutely vital for the correct interpretation 
of a musical passage. 

The ‘raw’ score files extracted by these algorithms from the performance data 
(up to now, some 150.000 lines of text) still needed to be manually corrected 
and further annotated. And finally, the resulting score files were matched, in a 
semi-automatic process, with the performance files to establish the exact note- 
to-note correspondence; thousands of notes were manually identified and labelled 
as missing or extraneous (most of these are related to ornaments like trills etc.). 
Prom this information we could then finally compute all the detailed aspects of 
a performer’s expressive playing (e.g., tempo changes, articulation details etc.) 
that serve as training data in the inductive learning process. 



3.3 Enhancing the Data: Musical Structure Analysis 

The next problem concerns the representation of the music. What we are search- 
ing for are systematic connections between the structure of the music (e.g., har- 
monic, metrical, and phrase structure) and patterns in the performances (e.g., 
a gradual rise in loudness {crescendo) over a given phrase). The representation 
of the musical pieces must therefore be extended with an explicit description of 
certain structural aspects. Again, a complete manual analysis of a large number 
of complex pieces is infeasible or at least highly impractical, so there is a need 
for computational methods. In the context of our project, we have developed a 
number of new music analysis algorithms that make explicit different structural 
aspects of a piece such as its segment structure [5], categories of melodic motifs 
and their recurrence [6], and various types of common melodic, harmonic, and 
rhythmic patterns, as postulated by music theorists [15]. These algorithms are of 
general interest to musicology, as they constitute formal computational models 
of aspects of musical structure understanding that had not hitherto been suffi- 
ciently formalized in music theory. The analyses computed by these tools can be 
used as additional descriptors in the representation of musical pieces. 
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4 Challenges to Machine Learning 

The result of all these efforts are training data as exemplified in Fig. 2, which 
shows the dynamics and tempo deviations extracted from performances, by three 
different pianists, of a well-known piece by Frederic Chopin. For the moment, we 
restrict our attention to how the melodies of the pieces are played (and neglect 
more complex aspects like interactions between different voices of a piece). All 
the three expression dimensions that we are currently studying — tempo/timing, 
dynamics, and articulation — can then be represented as curves that associate 
a particular tempo, loudness, or relative duration value with each melody note. 
Fig. 2 also contains a number of annotations added by the author to highlight 
different structural aspects of both the piece itself, and the performances. These 
will be of help in the following discussion. 

The first question one might ask is: is there something to be learned at all? 
Isn’t expressive performance something intangible, something that reflects the 
artistic uniqueness of a performer and thus necessarily escapes any attempt at 
formalisation or explanation? A look at Fig. 2 reassures us: there are, of course, 
individual differences in the interpretations by the three pianists, but there are 
also very clear commonalities in the three curves. In other words, there seem to 
be some strong common principles at work that lead performers to do things 
in a similar way. And these common performance patterns must somehow be 
determined by the structure of the music being played. 

In fact, this situation affords opportunities for at least two different types of 
learning. The one that better fits the ‘traditional’ inductive learning setting is 
learning to characterise and predict the commonalities between performances. 
In the simplest case, a learner that is given different performances of the same 
piece can be expected to find descriptions of those patterns that are common 
to most of the performances, and treat those situations where the individual 
performers differ as noise} Characterising common performance patterns that 
point to some fundamental underlying principles is indeed the primary goal of 
our project. But it would also be interesting to try to learn about characteristic 
differences between individual artists. Here, the problem is not to find out where 
two performers differ — that is directly obvious from the data — but to find 
classes of situations in which there is a systematic and explicitly characteriseable 
difference in behaviour. This might be a novel problem for machine learning. 

A related question is how much we can expect to learn and formalise. Clearly, 
we cannot expect artists to be entirely predictable. We will have to make do with 
models that explain only a (possibly small) fraction of the observed phenomena. 
This requirement favours learners that, rather than trying to cover all of the 
instance space, can focus on those subspaces where something can be learned 
and produce models that clearly indicate when something is outside their area 
of expertise. Some interesting results along these lines are reported in [26]. 

* ‘Real’ noise (in the sense of mistakes or inaccnracies by the performer) is not much 
of a problem — high-class pianists are extremely precise, both in terms of motor 
control and in terms of their memory and capacity to reproduce particular expressive 
patterns over repeated performances. 
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Another fundamental question is: what are the target concepts? And that re- 
lates to a number of deep problems concerning representation, abstraction level, 
and context. At first sight, the curves in Fig. 2 are reminiscent of time series, 
which suggests the use of methods from time series analysis and forecasting. 
However, this is an inappropriate view. It is not so much the past states that de- 
termine how a curve is going to continue into the future; it is the structure of the 
underlying musical piece that partly determines what ‘shapes’ or ‘envelopes’ (in 
tempo, dynamics, etc.) a performer will apply to the music. The question then 
arises as to what exactly the scope of these ‘shapes’ is, and what the structural 
units in the music are to which these ‘shapes’ are applied — in other words, 
what is the appropriate abstraction level? 

Actually, musical expression is a multi-level phenomenon. Good performances 
exhibit structure at several levels. Local deviations expressing detailed nuances 
(e.g., the stressing of a particular note) will be embedded in more extended, 
higher-level expressive shapes, such as a general accelerando-ritardando (speed- 
ing up - slowing down) over an entire phrase. For instance, the expression curves 
in Fig. 2 exhibit both local, note-level (see notes marked by asterisks) and 
more global structural patterns (e.g., a clear crescendo-decrescendo applied the 
medium- level phrase A.l (dynamics curve), and an ever so slight accelerando- 
ritardando over phrase A in the tempo dimension). Thus, it will be necessary 
to learn models at different structural abstraction levels, which introduces the 
additional problems of discerning and separating mnltiple pattern levels in given 
training observations, and of combining learned models of different granularity at 
prediction time. Moreover, apart from the note and the phrase levels, there may 
be other, intermediate structural units relevant to explaining certain aspects of 
the curves. Discovering these is an intriguing musicological problem. One of our 
plans here is to study the utility of new substructure discovery algorithms [8]. 

Generally, the representation problem is a non-trivial one. There are many 
conceptual frameworks in which music can be described. Finding the most ap- 
propriate music-structural descriptors is a question of musicological interest. 
Systematic experimentation with different music-theoretic vocabularies will be 
necessary to identify these. In addition, the representation shonld captnre the 
relevant context of notes and musical structures, which is a tricky issue not 
only because we do not know exactly how large this context should be, but 
also because there are also some highly non-local effects at work (e.g., when 
the recurrence of a melodic motif prompts the performer to ‘fall back’ into a 
previons pattern). As for the essentially relational nature of music, which would 
suggest the use of first-order logic for knowledge representation and Inductive 
Logic Programming for learning, it will be a matter of experimentation to study 
the trade-off between the increase in expressive power and the increase in search 
complexity implied by the use of ILP algorithms (see [14]). 

Another interesting observation, which may be a sonrce of new learning prob- 
lems, is that the different target dimensions are very likely to interact or be 
inter-dependent. The performances in Fig.2 exhibit some clear parallels between 
dynamics and tempo, particularly in the case of some local deviations. For in- 
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stance, there seems to be a strong correlation between dynamic emphasis and 
individual note lengthening (see the events marked by asterisks in Fig. 2). At 
a higher level, one could construe a certain parallelism between the dynamics 
and tempo shapes of the second of the high-level phrases (B) (see the arcs in 
the dynamics and tempo plots), which would confirm a general hypothesis by 
musicologists.^ In general, performers have different means of stressing musical 
passages, by combining timing, dynamics, and articulation in certain ways. This 
suggests that expressive performance might be an ideal candidate domain for 
multi-task learning [7], where multiple learning tasks are pursued in parallel us- 
ing a shared representation, which presumably enables the learner to transfer 
information between different related problems. Moreover, we would be inter- 
ested in an explicit characterisation of the connection between, say, timing and 
dynamics, if there is one. This seems to be a new type of learning problem. 

And finally, there is the evaluation problem. How is one to evaluate and quan- 
tify the validity of a given theory in a domain where there is no unique ‘correct’ 
solution (there are usually many ‘acceptable’ ways of performing a piece)? The 
empirical evaluation methods used in machine learning (measuring classification 
accuracy and prediction error on unseen data, estimating true error via cross- 
validation etc.) do have their place here, but they need to be complemented with 
more music-specific methods that, while avoiding to make judgments concerning 
the musical or aesthetic quality of a performance, do account for musical aspects 
of a model’s predictions. This is a challenging research question for musicology 
and is beyond the scope of the present paper. 

5 First Results and Ongoing Research 

It is only rather recently that we have begun to perform systematic learning 
experiments with the huge data collections mentioned in section 3.1, so most 
of the above questions and challenges are still open. Our investigations so far 
have mostly concentrated on the note level, i.e., on describing and predicting 
how individual notes will be played, given various features of the notes and their 
immediate context. Here is a brief list of the most interesting results so far: 

Basic learnability: In a first suite of experiments [25], we succeeded in show- 
ing that even at the level of individual notes, there is structure that can 
be learned. Standard inductive learners managed to predict the performer’s 
choices with better than chance probability. Extented feature selection exper- 
iments showed that different sets of music-theoretic descriptors are relevant 
for different expressive dimensions (timing, dynamics, articulation). 

New rule learning algorithm: Based on experiences gathered in these initial 
investigations, we developed a new rule learning algorithm named PLCG 
that can find simple partial theories in complex data where neither high 
coverage nor high precision can be expected. The PLCG algorithm and some 

^ In fact, this parallelism becomes clearer once certain local distortions and artifacts in 
the expression curves (caused, e.g., by the grace notes in bars 7 and 8) are removed. 
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experiments with it are described in more detail elsewhere in this volume [26]. 

Partial note-level rule model: PLCG has discovered a number of surpris- 
ingly simple and surprisingly general and robust^ note-level expression prin- 
ciples [27,28]. These rules are currently investigated more closely from a 
musicological perspective; some of them will probably form the nucleus of a 
quantitative rule-based model of note-level timing and articulation. 

Learning at higher structural levels: In some limited earlier studies [23, 
24], we had already found indications that learning at multiple structural 
levels does indeed improve the results (and the musical quality of the result- 
ing computer-generated performances) considerably. However, the definitions 
of these higher musical levels and particularly the methods for combining 
learned theories of different granularity were very ad hoc, and the training 
material was extremely limited. We are currently developing a more princi- 
pled approach. 

Discovering stylistic differences: Regarding the possibility of discovering 
stylistic differences between different performers, we had obtained first in- 
direct positive evidence in an early experiment that involved performances 
of the same piece by both the famous Vladimir Horowitz and a number of 
advanced piano students [22]. There it turned out that rules learned from 
Horowitz yielded a significantly higher predictive accuracy on other Horowitz 
data than on the student data, and vice versa. Recently, we have started new 
focussed investigations on this issue, with the aim of finding characterisations 
of these differences. This can be done with standard inductive rule learning 
algorithms, but requires the design of a different type of learning scenario. 
In a small initial experiment, several interesting rules were discovered that 
might describe characteristic differences in behaviour between the two great 
pianists Alfred Cortot and V. Horowitz. But the data were much too limited 
to permit general conclusions. We are now planning to repeat this type of 
experiments with a much more extended data set. 

Machine learning for structural music analysis: And finally, computa- 
tional music research offers many other opportunities for machine learning 
that are not necessarily related to the performance issue itself. There are 
many problems in automated structural music analysis for which there are 
as yet no reliable algorithms (e.g., harmonic analysis, phrase structure anal- 
ysis, etc.) and which could benefit from inductive learning. For instance, 
we have developed an algorithm for finding classes of musical motifs and 
for elucidating the motivic structure of a piece, based on a new clustering 
method. This algorithm has been shown to be capable of reproducing mo- 
tivic analyses by human musicologists of such complex pieces as Schumann’s 
Trdumerei and Debussy’s Syrinx [6], and of predicting the categorizations 
made by human listeners [4]. 

® For instance, 4 simple timing rules turn out to be sufficient for correctly predicting 
more than 20% of a pianist’s local ritardandi, and these rules seem to generalize well 
to music of different styles. 
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Obviously, these are just first steps in a long research journey that should 
take us closer to our final goal — a quantitative, composite compntational theory 
that explains as much as possible of the various dimensions of expressive music 
performance, and the interactions between them — and that will force us to 
address a number of novel machine learning problems on the way. This is a long- 
term undertaking, and we would like to extend an invitation to motivated young 
researchers to join our project team and work with ns towards this goal. 



Acknowledgements 

The project is made possible by a very generous START Research Prize by 
the Austrian Federal Government, administered by the Austrian Fonds zur 
Forderung der Wissenschaftlichen Forschung (FWF) (project no. Y99-INF). Ad- 
ditional support for our research on machine learning and music is provided 
by the European project HPRN-CT-2000-00115 (MOSART). The Austrian Re- 
search Institute for Artificial Intelligence acknowledges basic financial support 
by the Austrian Federal Ministry for Education, Science, and Culture. I would 
like to thank my colleagues Emilios Cambouropoulos, Simon Dixon, and Werner 
Goebl for their cooperation and many fruitful and enjoyable discussions. 



References 

1. Bresin, R. (2000). Virtual Virtuosity: Studies in Automatic Music Performance. 
Doctoral Dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden. 

2. Cambouropoulos, E. (2000). Prom MIDI to Traditional Musical Notation. In Pro- 
ceedings of the AAAP2000 Workshop on Artificial Intelligence and Music, 17th 
National Conference on Artificial Intelligence (AAAI’2000), Austin, TX. Menlo 
Park, CA: AAAI Press. 

3. Cambouropoulos, E. (2001). Automatic Pitch Spelling: Prom Numbers to Sharps 
and Plats. In Proceedings of the 8th Brazilian Symposium on Computer Music, 
Fortaleza, Brazil. 

4. Cambouropoulos, E. (2001). Melodic Cue Abstraction, Similarity, and Category 
Formation: A Formal Model. Music Perception, 18(3) (in press). 

5. Cambouropoulos, E. (2001). The Local Boundary Detection Model (LBDM) and 
its Application in the Study of Expressive Timing. In Proceedings of the Interna- 
tional Computer Music Conference (ICMC’2001). San Francisco, CA: International 
Computer Music Association. 

6. Cambouropoulos, E. and Widmer, G. (2000). Automatic Motivic Analysis via 
Melodic Clustering. Journal of New Music Research, 29(4) (in press). 

7. Caruana, R. (1997). Multitask Learning. Machine Learning 28(1), 41-75. 

8. De Raedt, L. & Kramer, S. (2001). The Levelwise Versionspace Algorithm and its 
Application to Molecular Fragment Finding. In Proceedings of the 17th Interna- 
tional Joint Conference on Artificial Intelligence (IJCAI-01), Seattle, WA. 

9. Dixon, S. (2000). Extraction of Musical Performance Parameters from Audio Data. 
In Proceedings of the First IEEE Pacific-Rim Conference on Multimedia (PCM 
2000), Sydney, Australia. 




506 G. Widmer 



10. Dixon, S. (2001). Automatic Extraction of Tempo and Beat from Expressive Per- 
formances. Journal of New Music Research (in press). 

11. Dixon, S. and Cambouropoulos, E. (2000). Beat Tracking with Musical Knowledge. 
In Proceedings of the 14th European Conference on Artificial Intelligence (ECAI- 
2000), Berlin. lOS Press, Amsterdam. 

12. Eriberg, A. (1995). A Quantitative Rule System for Musical Performance. Ph.D. 
dissertation. Department of Speech Communication and Music Acoustics, Royal 
Institute of Technology (KTH), Stockholm. 

13. Eriberg, A., Bresin, R., Eryden, L., and Sundberg, J. (1998). Musical Punctuation 
on the Microlevel: Automatic Identification and Performance of Small Melodic 
Units. Journal of New Music Research 27(3), 271-292. 

14. Kramer, S. (1999). Relational Learning vs. Propositionalization. Investigations in 
Inductive Logic Programming and Propositional Machine Learning. Ph.D. thesis. 
Technical University of Vienna. 

15. Narmour, E. (1992). The Analysis and Cognition of Melodic Complexity: The 
Implication-Realization Model. Chicago, IL: University of Chicago Press. 

16. Palmer, C. (1988). Timing in Skilled Piano Performance. Ph.D. Dissertation, Cor- 
nell University. 

17. Repp, B. (1992). Diversity and Commonality in Music Performance: An Analysis 
of Timing Microstructure in Schumann’s ‘Traumerei’. Journal of the Acoustical 
Society of America 92(5), 2546-2568. 

18. Shaffer, L.H. (1980). Analyzing Piano Performance: A Study of Concert Pianists. 
In G.Stelnmach and J. Requin (eds.). Tutorials in Motor Behavior. Amsterdam: 
North-Holland. 

19. Sundberg, J., Eriberg, A., and Eryden, L. (1991). Common Secrets of Musicians and 
Listeners: An Analysis-by-Synthesis Study of Musical Performance. In P. Howell, R. 
West & I. Cross (eds.). Representing Musical Structure. London: Academic Press. 

20. Todd, N. (1989). Towards a Cognitive Theory of Expression: The Performance and 
Perception of Rubato. Contemporary Music Review, vol. 4, pp. 405-416. 

21. Todd, N. (1992). The Dynamics of Dynamics: A Model of Musical Expression. 
Journal of the Acoustical Society of America 91, pp. 3540-3650. 

22. Widmer, G. (1996). What Is It That Makes It a Horowitz? Empirical Musicology 
via Machine Learning. In Proceedings of the 12th European Conference on Artificial 
Intelligence (ECAI-96), Budapest. Wiley & Sons, Chichester, UK. 

23. Widmer, G. (1996). Learning Expressive Performance: The Structure-Level Ap- 
proach. Journal of New Music Research 25(2), pp. 179-205. 

24. Widmer, G. (1998). Applications of Machine Learning to Music Research: Empir- 
ical Investigations into the Phenomenon of Musical Expression. In R.S. Michalski, 
I. Bratko and M. Kubat (eds.). Machine Learning, Data Mining and Knowledge 
Discovery: Methods and Applications. Ghichester, UK: Wiley & Sons. 

25. Widmer, G. (2000). Large-scale Induction of Expressive Performance Rules: First 
Quantitative Results. In Proceedings of the International Computer Music Confer- 
ence (ICMC’2000). San Erancisco, CA: International Computer Music Association. 

26. Widmer, G. (2001). Discovering Strong Principles of Expressive Music Performance 
with the PLCG Rule Learning Strategy. In Proceedings of the 11th European Con- 
ference on Machine Learning (ECML’Ol), Ereiburg. Berlin: Springer Verlag. 

27. Widmer, G. (2001). Inductive Learning of General and Robust Local Expres- 
sion Principles. In Proceedings of the International Computer Music Conference 
(ICMC’2001). San Francisco, CA: International Computer Music Association. 

28. Widmer, G. (2001). Machine Discoveries: Some Simple, Robust Local Expression 
Principles. Submitted. 




Scalability, Search, and Sampling: Prom Smart 
Algorithms to Active Discovery 



Stefan Wrobel 

Otto-von-Guericke-Universitat Magdeburg 
School of Computer Science, IWS 
Knowledge Discovery and Machine Learning Group 
http : //kd . cs . uni-magdeburg . de 
P.O.Box 4120, Universitatsplatz 2 
39016 Magdeburg, Germany 
wrobel@iws . cs .uni-magdeburg . de 



Abstract. The focus on scalability to very large datasets has been a dis- 
tinguishing feature of the KDD endeavour right from the start of the area. 
In the present stage of its development, the field has begun to seriously 
approach the issue, and a number of different techniques for scaling up 
KDD algorithms have emerged. Traditionally, such techniques are con- 
centrating on the search aspects of the problem, employing algorithmic 
techniques to avoid searching parts of the space or to speed up processing 
by exploiting properties of the underlying host systems. Such techniques 
guarantee perfect correctness of solutions, but can never reach sublinear 
complexity. In contrast, researchers have recently begun to take a fresh 
and principled look at stochastic sampling techniques which give only an 
approximate quality guarantee, but can make runtimes almost indepen- 
dent of the size of the database at hand. In the talk, we give an overview 
of both of these classes of approaches, focusing on individual examples 
from our own work for more detailed illustrations of how such techniques 
work. We briefly outline how active learning elements may enhance KDD 
approaches in the future. 



L. De Raedt and A. Siebes (Eds.): PKDD 2001, LNAI 2168, p. 507, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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